i

Data-Intensive Text Processing
with MapReduce
Jimmy Lin and Chris Dyer
University of Maryland, College Park
Manuscript prepared April 11, 2010
This is the pre-production manuscript of a book in the Morgan & Claypool Synthesis
Lectures on Human Language Technologies. Anticipated publication date is mid-2010.
ii
Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why Is This Different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2
MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3
MapReduce Algorithm Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Combiners and In-Mapper Combining 41
3.1.2 Algorithmic Correctness with Local Aggregation 46
3.2 Pairs and Stripes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.1 Reduce-Side Join 64
3.5.2 Map-Side Join 66
3.5.3 Memory-Backed Join 67
CONTENTS iii
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4
Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 Web Crawling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Inverted Indexing: Revised Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Index Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.1 Byte-Aligned and Word-Aligned Codes 80
4.5.2 Bit-Aligned Codes 82
4.5.3 Postings Compression 84
4.6 What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5
Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6
EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.1 Maximum Likelihood Estimation 115
6.1.2 A Latent Variable Marble Game 117
6.1.3 MLE with Latent Variables 118
6.1.4 Expectation Maximization 119
6.1.5 An EM Example 120
6.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2.1 Three Questions for Hidden Markov Models 123
6.2.2 The Forward Algorithm 125
6.2.3 The Viterbi Algorithm 126
iv CONTENTS
6.2.4 Parameter Estimation for HMMs 129
6.2.5 Forward-Backward Training: Summary 133
6.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.1 HMM Training in MapReduce 135
6.4 Case Study: Word Alignment for Statistical Machine Translation . . . . . 138
6.4.1 Statistical Phrase-Based Translation 139
6.4.2 Brief Digression: Language Modeling with MapReduce 142
6.4.3 Word Alignment 143
6.4.4 Experiments 144
6.5 EM-Like Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5.1 Gradient-Based Optimization and Log-Linear Models 147
6.6 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7
Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2 Alternative Computing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3 MapReduce and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
1
C H A P T E R 1
Introduction
MapReduce [45] is a programming model for expressing distributed computations on
massive amounts of data and an execution framework for large-scale data processing
on clusters of commodity servers. It was originally developed by Google and built on
well-known principles in parallel and distributed processing dating back several decades.
MapReduce has since enjoyed widespread adoption via an open-source implementation
called Hadoop, whose development was led by Yahoo (now an Apache project). Today,
a vibrant software ecosystem has sprung up around Hadoop, with significant activity
in both industry and academia.
This book is about scalable approaches to processing large amounts of text with
MapReduce. Given this focus, it makes sense to start with the most basic question:
Why? There are many answers to this question, but we focus on two. First, “big data”
is a fact of the world, and therefore an issue that real-world systems must grapple with.
Second, across a wide range of text processing applications, more data translates into
more effective algorithms, and thus it makes sense to take advantage of the plentiful
amounts of data that surround us.
Modern information societies are defined by vast repositories of data, both public
and private. Therefore, any practical application must be able to scale up to datasets
of interest. For many, this means scaling up to the web, or at least a non-trivial frac-
tion thereof. Any organization built around gathering, analyzing, monitoring, filtering,
searching, or organizing web content must tackle large-data problems: “web-scale” pro-
cessing is practically synonymous with data-intensive processing. This observation ap-
plies not only to well-established internet companies, but also countless startups and
niche players as well. Just think, how many companies do you know that start their
pitch with “we’re going to harvest information on the web and. . . ”?
Another strong area of growth is the analysis of user behavior data. Any operator
of a moderately successful website can record user activity and in a matter of weeks (or
sooner) be drowning in a torrent of log data. In fact, logging user behavior generates
so much data that many organizations simply can’t cope with the volume, and either
turn the functionality off or throw away data after some time. This represents lost
opportunities, as there is a broadly-held belief that great value lies in insights derived
from mining such data. Knowing what users look at, what they click on, how much
time they spend on a web page, etc. leads to better business decisions and competitive
advantages. Broadly, this is known as business intelligence, which encompasses a wide
range of technologies including data warehousing, data mining, and analytics.
2 CHAPTER 1. INTRODUCTION
How much data are we talking about? A few examples: Google grew from pro-
cessing 100 TB of data a day with MapReduce in 2004 [45] to processing 20 PB a day
with MapReduce in 2008 [46]. In April 2009, a blog post
1
was written about eBay’s
two enormous data warehouses: one with 2 petabytes of user data, and the other with
6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new
records per day. Shortly thereafter, Facebook revealed
2
similarly impressive numbers,
boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Petabyte
datasets are rapidly becoming the norm, and the trends are clear: our ability to store
data is fast overwhelming our ability to process what we store. More distressing, in-
creases in capacity are outpacing improvements in bandwidth such that our ability to
even read back what we store is deteriorating [91]. Disk capacities have grown from tens
of megabytes in the mid-1980s to about a couple of terabytes today (several orders of
magnitude). On the other hand, latency and bandwidth have improved relatively little:
in the case of latency, perhaps 2 improvement during the last quarter century, and
in the case of bandwidth, perhaps 50. Given the tendency for individuals and organi-
zations to continuously fill up whatever capacity is available, large-data problems are
growing increasingly severe.
Moving beyond the commercial sphere, many have recognized the importance of
data management in many scientific disciplines, where petabyte-scale datasets are also
becoming increasingly common [21]. For example:
• The high-energy physics community was already describing experiences with
petabyte-scale databases back in 2005 [20]. Today, the Large Hadron Collider
(LHC) near Geneva is the world’s largest particle accelerator, designed to probe
the mysteries of the universe, including the fundamental nature of matter, by
recreating conditions shortly following the Big Bang. When it becomes fully op-
erational, the LHC will produce roughly 15 petabytes of data a year.
3
• Astronomers have long recognized the importance of a “digital observatory” that
would support the data needs of researchers across the globe—the Sloan Digital
Sky Survey [145] is perhaps the most well known of these projects. Looking into
the future, the Large Synoptic Survey Telescope (LSST) is a wide-field instrument
that is capable of observing the entire sky every few days. When the telescope
comes online around 2015 in Chile, its 3.2 gigapixel primary camera will produce
approximately half a petabyte of archive images every month [19].
• The advent of next-generation DNA sequencing technology has created a deluge
of sequence data that needs to be stored, organized, and delivered to scientists for
1
http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
2
http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/
3
http://public.web.cern.ch/public/en/LHC/Computing-en.html
3
further study. Given the fundamental tenant in modern genetics that genotypes
explain phenotypes, the impact of this technology is nothing less than transfor-
mative [103]. The European Bioinformatics Institute (EBI), which hosts a central
repository of sequence data called EMBL-bank, has increased storage capacity
from 2.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Scientists are predicting
that, in the not-so-distant future, sequencing an individual’s genome will be no
more complex than getting a blood test today—ushering a new era of personalized
medicine, where interventions can be specifically targeted for an individual.
Increasingly, scientific breakthroughs will be powered by advanced computing capabil-
ities that help researchers manipulate, explore, and mine massive datasets [72]—this
has been hailed as the emerging “fourth paradigm” of science [73] (complementing the-
ory, experiments, and simulations). In other areas of academia, particularly computer
science, systems and algorithms incapable of scaling to massive real-world datasets run
the danger of being dismissed as “toy systems” with limited utility. Large data is a fact
of today’s world and data-intensive processing is fast becoming a necessity, not merely
a luxury or curiosity.
Although large data comes in a variety of forms, this book is primarily concerned
with processing large amounts of text, but touches on other types of data as well (e.g.,
relational and graph data). The problems and solutions we discuss mostly fall into the
disciplinary boundaries of natural language processing (NLP) and information retrieval
(IR). Recent work in these fields is dominated by a data-driven, empirical approach,
typically involving algorithms that attempt to capture statistical regularities in data for
the purposes of some task or application. There are three components to this approach:
data, representations of the data, and some method for capturing regularities in the
data. Data are called corpora (singular, corpus) by NLP researchers and collections by
those from the IR community. Aspects of the representations of the data are called fea-
tures, which may be “superficial” and easy to extract, such as the words and sequences
of words themselves, or “deep” and more difficult to extract, such as the grammatical
relationship between words. Finally, algorithms or models are applied to capture regu-
larities in the data in terms of the extracted features for some application. One common
application, classification, is to sort text into categories. Examples include: Is this email
spam or not spam? Is this word part of an address or a location? The first task is
easy to understand, while the second task is an instance of what NLP researchers call
named-entity detection [138], which is useful for local search and pinpointing locations
on maps. Another common application is to rank texts according to some criteria—
search is a good example, which involves ranking documents by relevance to the user’s
query. Another example is to automatically situate texts along a scale of “happiness”,
a task known as sentiment analysis or opinion mining [118], which has been applied to
4 CHAPTER 1. INTRODUCTION
everything from understanding political discourse in the blogosphere to predicting the
movement of stock prices.
There is a growing body of evidence, at least in text processing, that of the three
components discussed above (data, features, algorithms), data probably matters the
most. Superficial word-level features coupled with simple models in most cases trump
sophisticated models over deeper features and less data. But why can’t we have our cake
and eat it too? Why not both sophisticated models and deep features applied to lots of
data? Because inference over sophisticated models and extraction of deep features are
often computationally intensive, they don’t scale well.
Consider a simple task such as determining the correct usage of easily confusable
words such as “than” and “then” in English. One can view this as a supervised machine
learning problem: we can train a classifier to disambiguate between the options, and
then apply the classifier to new instances of the problem (say, as part of a grammar
checker). Training data is fairly easy to come by—we can just gather a large corpus of
texts and assume that most writers make correct choices (the training data may be noisy,
since people make mistakes, but no matter). In 2001, Banko and Brill [14] published
what has become a classic paper in natural language processing exploring the effects
of training data size on classification accuracy, using this task as the specific example.
They explored several classification algorithms (the exact ones aren’t important, as we
shall see), and not surprisingly, found that more data led to better accuracy. Across
many different algorithms, the increase in accuracy was approximately linear in the
log of the size of the training data. Furthermore, with increasing amounts of training
data, the accuracy of different algorithms converged, such that pronounced differences
in effectiveness observed on smaller datasets basically disappeared at scale. This led to
a somewhat controversial conclusion (at least at the time): machine learning algorithms
really don’t matter, all that matters is the amount of data you have. This resulted in
an even more controversial recommendation, delivered somewhat tongue-in-cheek: we
should just give up working on algorithms and simply spend our time gathering data
(while waiting for computers to become faster so we can process the data).
As another example, consider the problem of answering short, fact-based questions
such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the
user would then have to sort through, a question answering (QA) system would directly
return the answer: John Wilkes Booth. This problem gained interest in the late 1990s,
when natural language processing researchers approached the challenge with sophisti-
cated linguistic processing techniques such as syntactic and semantic analysis. Around
2001, researchers discovered a far simpler approach to answering such questions based
on pattern matching [27, 53, 92]. Suppose you wanted the answer to the above question.
As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the
web and look for what appears to its left. Or better yet, look through multiple instances
5
of this phrase and tally up the words that appear to the left. This simple strategy works
surprisingly well, and has become known as the redundancy-based approach to question
answering. It capitalizes on the insight that in a very large text collection (i.e., the
web), answers to commonly-asked questions will be stated in obvious ways, such that
pattern-matching techniques suffice to extract answers accurately.
Yet another example concerns smoothing in web-scale language models [25]. A
language model is a probability distribution that characterizes the likelihood of observ-
ing a particular sequence of words, estimated from a large corpus of texts. They are
useful in a variety of applications, such as speech recognition (to determine what the
speaker is more likely to have said) and machine translation (to determine which of
possible translations is the most fluent, as we will discuss in Section 6.4). Since there
are infinitely many possible strings, and probabilities must be assigned to all of them,
language modeling is a more challenging task than simply keeping track of which strings
were seen how many times: some number of likely strings will never be encountered,
even with lots and lots of training data! Most modern language models make the Markov
assumption: in a n-gram language model, the conditional probability of a word is given
by the n −1 previous words. Thus, by the chain rule, the probability of a sequence of
words can be decomposed into the product of n-gram probabilities. Nevertheless, an
enormous number of parameters must still be estimated from a training corpus: poten-
tially V
n
parameters, where V is the number of words in the vocabulary. Even if we
treat every word on the web as the training corpus from which to estimate the n-gram
probabilities, most n-grams—in any language, even English—will never have been seen.
To cope with this sparseness, researchers have developed a number of smoothing tech-
niques [35, 102, 79], which all share the basic idea of moving probability mass from
observed to unseen events in a principled manner. Smoothing approaches vary in ef-
fectiveness, both in terms of intrinsic and application-specific metrics. In 2007, Brants
et al. [25] described language models trained on up to two trillion words.
4
Their ex-
periments compared a state-of-the-art approach known as Kneser-Ney smoothing [35]
with another technique the authors affectionately referred to as “stupid backoff”.
5
Not
surprisingly, stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller
corpora. However, it was simpler and could be trained on more data, which ultimately
yielded better language models. That is, a simpler technique on more data beat a more
sophisticated technique on less data.
4
As an aside, it is interesting to observe the evolving definition of large over the years. Banko and Brill’s paper
in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with
a corpus containing a billion words.
5
As in, so stupid it couldn’t possibly work.
6 CHAPTER 1. INTRODUCTION
Recently, three Google researchers summarized this data-driven philosophy in
an essay titled The Unreasonable Effectiveness of Data [65].
6
Why is this so? It boils
down to the fact that language in the wild, just like human behavior in general, is
messy. Unlike, say, the interaction of subatomic particles, human use of language is
not constrained by succinct, universal “laws of grammar”. There are of course rules
that govern the formation of words and sentences—for example, that verbs appear
before objects in English, and that subjects and verbs must agree in number in many
languages—but real-world language is affected by a multitude of other factors as well:
people invent new words and phrases all the time, authors occasionally make mistakes,
groups of individuals write within a shared context, etc. The Argentine writer Jorge
Luis Borges wrote a famous allegorical one-paragraph story about a fictional society
in which the art of cartography had gotten so advanced that their maps were as big
as the lands they were describing.
7
The world, he would say, is the best description of
itself. In the same way, the more observations we gather about language use, the more
accurate a description we have of language itself. This, in turn, translates into more
effective algorithms and systems.
So, in summary, why large data? In some ways, the first answer is similar to
the reason people climb mountains: because they’re there. But the second answer is
even more compelling. Data represent the rising tide that lifts all boats—more data
lead to better algorithms and systems for solving real-world problems. Now that we’ve
addressed the why, let’s tackle the how. Let’s start with the obvious observation: data-
intensive processing is beyond the capability of any individual machine and requires
clusters—which means that large-data problems are fundamentally about organizing
computations on dozens, hundreds, or even thousands of machines. This is exactly
what MapReduce does, and the rest of this book is about the how.
1.1 COMPUTING IN THE CLOUDS
For better or for worse, it is often difficult to untangled MapReduce and large-data
processing from the broader discourse on cloud computing. True, there is substantial
promise in this new paradigm of computing, but unwarranted hype by the media and
popular sources threatens its credibility in the long run. In some ways, cloud computing
6
This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural
Sciences [155]. This is somewhat ironic in that the original article lauded the beauty and elegance of mathe-
matical models in capturing natural phenomena, which is the exact opposite of the data-driven approach.
7
On Exactitude in Science [23]. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by
Lewis Carroll (1893).
1.1. COMPUTING IN THE CLOUDS 7
is simply brilliant marketing. Before clouds, there were grids,
8
and before grids, there
were vector supercomputers, each having claimed to be the best thing since sliced bread.
So what exactly is cloud computing? This is one of those questions where ten
experts will give eleven different answers; in fact, countless papers have been written
simply to attempt to define the term (e.g., [9, 31, 149], just to name a few examples).
Here we offer up our own thoughts and attempt to explain how cloud computing relates
to MapReduce and data-intensive processing.
At the most superficial level, everything that used to be called web applications has
been rebranded to become “cloud applications”, which includes what we have previously
called “Web 2.0” sites. In fact, anything running inside a browser that gathers and stores
user-generated content now qualifies as an example of cloud computing. This includes
social-networking services such as Facebook, video-sharing sites such as YouTube, web-
based email services such as Gmail, and applications such as Google Docs. In this
context, the cloud simply refers to the servers that power these sites, and user data is
said to reside “in the cloud”. The accumulation of vast quantities of user data creates
large-data problems, many of which are suitable for MapReduce. To give two concrete
examples: a social-networking site analyzes connections in the enormous globe-spanning
graph of friendships to recommend new connections. An online email service analyzes
messages and user behavior to optimize ad selection and placement. These are all large-
data problems that have been tackled with MapReduce.
9
Another important facet of cloud computing is what’s more precisely known as
utility computing [129, 31]. As the name implies, the idea behind utility computing
is to treat computing resource as a metered service, like electricity or natural gas.
The idea harkens back to the days of time-sharing machines, and in truth isn’t very
different from this antiquated form of computing. Under this model, a “cloud user” can
dynamically provision any amount of computing resources from a “cloud provider” on
demand and only pay for what is consumed. In practical terms, the user is paying for
access to virtual machine instances that run a standard operating system such as Linux.
Virtualization technology (e.g., [15]) is used by the cloud provider to allocate available
physical resources and enforce isolation between multiple users that may be sharing the
8
What is the difference between cloud computing and grid computing? Although both tackle the fundamental
problem of how best to bring computational resources to bear on large and difficult problems, they start
with different assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a
datacenter or are distributed across a relatively small number of datacenters controlled by a single organization,
grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct
but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained,
and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple
administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these
challenges.
9
The first example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The
second is, of course, Google, which uses MapReduce to continuously improve existing algorithms and to devise
new algorithms for ad selection and placement.
8 CHAPTER 1. INTRODUCTION
same hardware. Once one or more virtual machine instances have been provisioned, the
user has full control over the resources and can use them for arbitrary computation.
Virtual machines that are no longer needed are destroyed, thereby freeing up physical
resources that can be redirected to other users. Resource consumption is measured in
some equivalent of machine-hours and users are charged in increments thereof.
Both users and providers benefit in the utility computing model. Users are freed
from upfront capital investments necessary to build datacenters and substantial reoccur-
ring costs in maintaining them. They also gain the important property of elasticity—as
demand for computing resources grow, for example, from an unpredicted spike in cus-
tomers, more resources can be seamlessly allocated from the cloud without an inter-
ruption in service. As demand falls, provisioned resources can be released. Prior to the
advent of utility computing, coping with unexpected spikes in demand was fraught with
challenges: under-provision and run the risk of service interruptions, or over-provision
and tie up precious capital in idle machines that are depreciating.
From the utility provider point of view, this business also makes sense because
large datacenters benefit from economies of scale and can be run more efficiently than
smaller datacenters. In the same way that insurance works by aggregating risk and re-
distributing it, utility providers aggregate the computing demands for a large number
of users. Although demand may fluctuate significantly for each user, overall trends in
aggregate demand should be smooth and predictable, which allows the cloud provider
to adjust capacity over time with less risk of either offering too much (resulting in in-
efficient use of capital) or too little (resulting in unsatisfied customers). In the world of
utility computing, Amazon Web Services currently leads the way and remains the dom-
inant player, but a number of other cloud providers populate a market that is becoming
increasingly crowded. Most systems are based on proprietary infrastructure, but there
is at least one, Eucalyptus [111], that is available open source. Increased competition
will benefit cloud users, but what direct relevance does this have for MapReduce? The
connection is quite simple: processing large amounts of data with MapReduce requires
access to clusters with sufficient capacity. However, not everyone with large-data prob-
lems can afford to purchase and maintain clusters. This is where utility computing
comes in: clusters of sufficient size can be provisioned only when the need arises, and
users pay only as much as is required to solve their problems. This lowers the barrier
to entry for data-intensive processing and makes MapReduce much more accessible.
A generalization of the utility computing concept is “everything as a service”,
which is itself a new take on the age-old idea of outsourcing. A cloud provider offering
customers access to virtual machine instances is said to be offering infrastructure as a
service, or IaaS for short. However, this may be too low level for many users. Enter plat-
form as a service (PaaS), which is a rebranding of what used to be called hosted services
in the “pre-cloud” era. Platform is used generically to refer to any set of well-defined
1.2. BIG IDEAS 9
services on top of which users can build applications, deploy content, etc. This class of
services is best exemplified by Google App Engine, which provides the backend data-
store and API for anyone to build highly-scalable web applications. Google maintains
the infrastructure, freeing the user from having to backup, upgrade, patch, or otherwise
maintain basic services such as the storage layer or the programming environment. At
an even higher level, cloud providers can offer software as a service (SaaS), as exem-
plified by Salesforce, a leader in customer relationship management (CRM) software.
Other examples include outsourcing an entire organization’s email to a third party,
which is commonplace today.
What does this proliferation of services have to do with MapReduce? No doubt
that “everything as a service” is driven by desires for greater business efficiencies, but
scale and elasticity play important roles as well. The cloud allows seamless expansion of
operations without the need for careful planning and supports scales that may otherwise
be difficult or cost-prohibitive for an organization to achieve. Cloud services, just like
MapReduce, represents the search for an appropriate level of abstraction and beneficial
divisions of labor. IaaS is an abstraction over raw physical hardware—an organization
might lack the capital, expertise, or interest in running datacenters, and therefore pays
a cloud provider to do so on its behalf. The argument applies similarly to PaaS and
SaaS. In the same vein, the MapReduce programming model is a powerful abstraction
that separates the what from the how of data-intensive processing.
1.2 BIG IDEAS
Tackling large-data problems requires a distinct approach that sometimes runs counter
to traditional models of computing. In this section, we discuss a number of “big ideas”
behind MapReduce. To be fair, all of these ideas have been discussed in the computer
science literature for some time (some for decades), and MapReduce is certainly not
the first to adopt these ideas. Nevertheless, the engineers at Google deserve tremendous
credit for pulling these various threads together and demonstrating the power of these
ideas on a scale previously unheard of.
Scale “out”, not “up”. For data-intensive workloads, a large number of commodity
low-end servers (i.e., the scaling “out” approach) is preferred over a small number of
high-end servers (i.e., the scaling “up” approach). The latter approach of purchasing
symmetric multi-processing (SMP) machines with a large number of processor sockets
(dozens, even hundreds) and a large amount of shared memory (hundreds or even thou-
sands of gigabytes) is not cost effective, since the costs of such machines do not scale
linearly (i.e., a machine with twice as many processors is often significantly more than
twice as expensive). On the other hand, the low-end server market overlaps with the
10 CHAPTER 1. INTRODUCTION
high-volume desktop computing market, which has the effect of keeping prices low due
to competition, interchangeable components, and economies of scale.
Barroso and H¨olzle’s recent treatise of what they dubbed “warehouse-scale com-
puters” [18] contains a thoughtful analysis of the two approaches. The Transaction
Processing Council (TPC) is a neutral, non-profit organization whose mission is to
establish objective database benchmarks. Benchmark data submitted to that organiza-
tion are probably the closest one can get to a fair “apples-to-apples” comparison of cost
and performance for specific, well-defined relational processing applications. Based on
TPC-C benchmark results from late 2007, a low-end server platform is about four times
more cost efficient than a high-end shared memory platform from the same vendor. Ex-
cluding storage costs, the price/performance advantage of the low-end server increases
to about a factor of twelve.
What if we take into account the fact that communication between nodes in
a high-end SMP machine is orders of magnitude faster than communication between
nodes in a commodity network-based cluster? Since workloads today are beyond the
capability of any single machine (no matter how powerful), the comparison is more ac-
curately between a smaller cluster of high-end machines and a larger cluster of low-end
machines (network communication is unavoidable in both cases). Barroso and H¨olzle
model these two approaches under workloads that demand more or less communication,
and conclude that a cluster of low-end servers approaches the performance of the equiv-
alent cluster of high-end servers—the small performance gap is insufficient to justify the
price premium of the high-end servers. For data-intensive applications, the conclusion
appears to be clear: scaling “out” is superior to scaling “up”, and therefore most existing
implementations of the MapReduce programming model are designed around clusters
of low-end commodity servers.
Capital costs in acquiring servers is, of course, only one component of the total
cost of delivering computing capacity. Operational costs are dominated by the cost of
electricity to power the servers as well as other aspects of datacenter operations that
are functionally related to power: power distribution, cooling, etc. [67, 18]. As a result,
energy efficiency has become a key issue in building warehouse-scale computers for
large-data processing. Therefore, it is important to factor in operational costs when
deploying a scale-out solution based on large numbers of commodity servers.
Datacenter efficiency is typically factored into three separate components that
can be independently measured and optimized [18]. The first component measures how
much of a building’s incoming power is actually delivered to computing equipment, and
correspondingly, how much is lost to the building’s mechanical systems (e.g., cooling,
air handling) and electrical infrastructure (e.g., power distribution inefficiencies). The
second component measures how much of a server’s incoming power is lost to the power
supply, cooling fans, etc. The third component captures how much of the power delivered
1.2. BIG IDEAS 11
to computing components (processor, RAM, disk, etc.) is actually used to perform useful
computations.
Of the three components of datacenter efficiency, the first two are relatively
straightforward to objectively quantify. Adoption of industry best-practices can help
datacenter operators achieve state-of-the-art efficiency. The third component, however,
is much more difficult to measure. One important issue that has been identified is the
non-linearity between load and power draw. That is, a server at 10% utilization may
draw slightly more than half as much power as a server at 100% utilization (which
means that a lightly-loaded server is much less efficient than a heavily-loaded server).
A survey of five thousand Google servers over a six-month period shows that servers
operate most of the time at between 10% and 50% utilization [17], which is an energy-
inefficient operating region. As a result, Barroso and H¨olzle have advocated for research
and development in energy-proportional machines, where energy consumption would
be proportional to load, such that an idle processor would (ideally) consume no power,
but yet retain the ability to power up (nearly) instantaneously in response to demand.
Although we have provided a brief overview here, datacenter efficiency is a topic
that is beyond the scope of this book. For more details, consult Barroso and H¨olzle [18]
and Hamilton [67], who provide detailed cost models for typical modern datacenters.
However, even factoring in operational costs, evidence suggests that scaling out remains
more attractive than scaling up.
Assume failures are common. At warehouse scale, failures are not only inevitable,
but commonplace. A simple calculation suffices to demonstrate: let us suppose that a
cluster is built from reliable machines with a mean-time between failures (MTBF) of
1000 days (about three years). Even with these reliable servers, a 10,000-server cluster
would still experience roughly 10 failures a day. For the sake of argument, let us suppose
that a MTBF of 10,000 days (about thirty years) were achievable at realistic costs (which
is unlikely). Even then, a 10,000-server cluster would still experience one failure daily.
This means that any large-scale service that is distributed across a large cluster (either
a user-facing application or a computing platform like MapReduce) must cope with
hardware failures as an intrinsic aspect of its operation [66]. That is, a server may fail at
any time, without notice. For example, in large clusters disk failures are common [123]
and RAM experiences more errors than one might expect [135]. Datacenters suffer
from both planned outages (e.g., system maintenance and hardware upgrades) and
unexpected outages (e.g., power failure, connectivity loss, etc.).
A well-designed, fault-tolerant service must cope with failures up to a point with-
out impacting the quality of service—failures should not result in inconsistencies or in-
determinism from the user perspective. As servers go down, other cluster nodes should
seamlessly step in to handle the load, and overall performance should gracefully degrade
as server failures pile up. Just as important, a broken server that has been repaired
12 CHAPTER 1. INTRODUCTION
should be able to seamlessly rejoin the service without manual reconfiguration by the
administrator. Mature implementations of the MapReduce programming model are able
to robustly cope with failures through a number of mechanisms such as automatic task
restarts on different cluster nodes.
Move processing to the data. In traditional high-performance computing (HPC)
applications (e.g., for climate or nuclear simulations), it is commonplace for a supercom-
puter to have “processing nodes” and “storage nodes” linked together by a high-capacity
interconnect. Many data-intensive workloads are not very processor-demanding, which
means that the separation of compute and storage creates a bottleneck in the network.
As an alternative to moving data around, it is more efficient to move the process-
ing around. That is, MapReduce assumes an architecture where processors and storage
(disk) are co-located. In such a setup, we can take advantage of data locality by running
code on the processor directly attached to the block of data we need. The distributed
file system is responsible for managing the data over which MapReduce operates.
Process data sequentially and avoid random access. Data-intensive processing
by definition means that the relevant datasets are too large to fit in memory and must
be held on disk. Seek times for random disk access are fundamentally limited by the
mechanical nature of the devices: read heads can only move so fast and platters can only
spin so rapidly. As a result, it is desirable to avoid random data access, and instead orga-
nize computations so that data is processed sequentially. A simple scenario
10
poignantly
illustrates the large performance gap between sequential operations and random seeks:
assume a 1 terabyte database containing 10
10
100-byte records. Given reasonable as-
sumptions about disk latency and throughput, a back-of-the-envelop calculation will
show that updating 1% of the records (by accessing and then mutating each record)
will take about a month on a single machine. On the other hand, if one simply reads the
entire database and rewrites all the records (mutating those that need updating), the
process would finish in under a work day on a single machine. Sequential data access
is, literally, orders of magnitude faster than random data access.
11
The development of solid-state drives is unlikely the change this balance for at
least two reasons. First, the cost differential between traditional magnetic disks and
solid-state disks remains substantial: large-data will for the most part remain on me-
chanical drives, at least in the near future. Second, although solid-state disks have
substantially faster seek times, order-of-magnitude differences in performance between
sequential and random access still remain.
MapReduce is primarily designed for batch processing over large datasets. To the
extent possible, all computations are organized into long streaming operations that
10
Adapted from a post by Ted Dunning on the Hadoop mailing list.
11
For more detail, Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.
1.2. BIG IDEAS 13
take advantage of the aggregate bandwidth of many disks in a cluster. Many aspects of
MapReduce’s design explicitly trade latency for throughput.
Hide system-level details from the application developer. According to many
guides on the practice of software engineering written by experienced industry profes-
sionals, one of the key reasons why writing code is difficult is because the programmer
must simultaneously keep track of many details in short term memory—ranging from
the mundane (e.g., variable names) to the sophisticated (e.g., a corner case of an algo-
rithm that requires special treatment). This imposes a high cognitive load and requires
intense concentration, which leads to a number of recommendations about a program-
mer’s environment (e.g., quiet office, comfortable furniture, large monitors, etc.). The
challenges in writing distributed software are greatly compounded—the programmer
must manage details across several threads, processes, or machines. Of course, the
biggest headache in distributed programming is that code runs concurrently in un-
predictable orders, accessing data in unpredictable patterns. This gives rise to race
conditions, deadlocks, and other well-known problems. Programmers are taught to use
low-level devices such as mutexes and to apply high-level “design patterns” such as
producer–consumer queues to tackle these challenges, but the truth remains: concur-
rent programs are notoriously difficult to reason about and even harder to debug.
MapReduce addresses the challenges of distributed programming by providing an
abstraction that isolates the developer from system-level details (e.g., locking of data
structures, data starvation issues in the processing pipeline, etc.). The programming
model specifies simple and well-defined interfaces between a small number of compo-
nents, and therefore is easy for the programmer to reason about. MapReduce maintains
a separation of what computations are to be performed and how those computations are
actually carried out on a cluster of machines. The first is under the control of the pro-
grammer, while the second is exclusively the responsibility of the execution framework
or “runtime”. The advantage is that the execution framework only needs to be de-
signed once and verified for correctness—thereafter, as long as the developer expresses
computations in the programming model, code is guaranteed to behave as expected.
The upshot is that the developer is freed from having to worry about system-level de-
tails (e.g., no more debugging race conditions and addressing lock contention) and can
instead focus on algorithm or application design.
Seamless scalability. For data-intensive processing, it goes without saying that scal-
able algorithms are highly desirable. As an aspiration, let us sketch the behavior of an
ideal algorithm. We can define scalability along at least two dimensions.
12
First, in terms
of data: given twice the amount of data, the same algorithm should take at most twice
as long to run, all else being equal. Second, in terms of resources: given a cluster twice
12
See also DeWitt and Gray [50] for slightly different definitions in terms of speedup and scaleup.
14 CHAPTER 1. INTRODUCTION
the size, the same algorithm should take no more than half as long to run. Furthermore,
an ideal algorithm would maintain these desirable scaling characteristics across a wide
range of settings: on data ranging from gigabytes to petabytes, on clusters consisting
of a few to a few thousand machines. Finally, the ideal algorithm would exhibit these
desired behaviors without requiring any modifications whatsoever, not even tuning of
parameters.
Other than for embarrassingly parallel problems, algorithms with the character-
istics sketched above are, of course, unobtainable. One of the fundamental assertions
in Fred Brook’s classic The Mythical Man-Month [28] is that adding programmers to a
project behind schedule will only make it fall further behind. This is because complex
tasks cannot be chopped into smaller pieces and allocated in a linear fashion, and is
often illustrated with a cute quote: “nine women cannot have a baby in one month”.
Although Brook’s observations are primarily about software engineers and the soft-
ware development process, the same is also true of algorithms: increasing the degree
of parallelization also increases communication costs. The algorithm designer is faced
with diminishing returns, and beyond a certain point, greater efficiencies gained by
parallelization are entirely offset by increased communication requirements.
Nevertheless, these fundamental limitations shouldn’t prevent us from at least
striving for the unobtainable. The truth is that most current algorithms are far from
the ideal. In the domain of text processing, for example, most algorithms today assume
that data fits in memory on a single machine. For the most part, this is a fair assumption.
But what happens when the amount of data doubles in the near future, and then doubles
again shortly thereafter? Simply buying more memory is not a viable solution, as the
amount of data is growing faster than the price of memory is falling. Furthermore, the
price of a machine does not scale linearly with the amount of available memory beyond
a certain point (once again, the scaling “up” vs. scaling “out” argument). Quite simply,
algorithms that require holding intermediate data in memory on a single machine will
simply break on sufficiently-large datasets—moving from a single machine to a cluster
architecture requires fundamentally different algorithms (and reimplementations).
Perhaps the most exciting aspect of MapReduce is that it represents a small step
toward algorithms that behave in the ideal manner discussed above. Recall that the
programming model maintains a clear separation between what computations need to
occur with how those computations are actually orchestrated on a cluster. As a result,
a MapReduce algorithm remains fixed, and it is the responsibility of the execution
framework to execute the algorithm. Amazingly, the MapReduce programming model
is simple enough that it is actually possible, in many circumstances, to approach the
ideal scaling characteristics discussed above. We introduce the idea of the “tradeable
machine hour”, as a play on Brook’s classic title. If running an algorithm on a particular
dataset takes 100 machine hours, then we should be able to finish in an hour on a cluster
1.3. WHY IS THIS DIFFERENT? 15
of 100 machines, or use a cluster of 10 machines to complete the same task in ten hours.
13
With MapReduce, this isn’t so far from the truth, at least for some applications.
1.3 WHY IS THIS DIFFERENT?
“Due to the rapidly decreasing cost of processing, memory, and communica-
tion, it has appeared inevitable for at least two decades that parallel machines
will eventually displace sequential ones in computationally intensive domains.
This, however, has not happened.” — Leslie Valiant [148]
14
For several decades, computer scientists have predicted that the dawn of the age of
parallel computing was “right around the corner” and that sequential processing would
soon fade into obsolescence (consider, for example, the above quote). Yet, until very re-
cently, they have been wrong. The relentless progress of Moore’s Law for several decades
has ensured that most of the world’s problems could be solved by single-processor ma-
chines, save the needs of a few (scientists simulating molecular interactions or nuclear
reactions, for example). Couple that with the inherent challenges of concurrency, and
the result has been that parallel processing and distributed systems have largely been
confined to a small segment of the market and esoteric upper-level electives in the
computer science curriculum.
However, all of that changed around the middle of the first decade of this cen-
tury. The manner in which the semiconductor industry had been exploiting Moore’s
Law simply ran out of opportunities for improvement: faster clocks, deeper pipelines,
superscalar architectures, and other tricks of the trade reached a point of diminish-
ing returns that did not justify continued investment. This marked the beginning of
an entirely new strategy and the dawn of the multi-core era [115]. Unfortunately, this
radical shift in hardware architecture was not matched at that time by corresponding
advances in how software could be easily designed for these new processors (but not for
lack of trying [104]). Nevertheless, parallel processing became an important issue at the
forefront of everyone’s mind—it represented the only way forward.
At around the same time, we witnessed the growth of large-data problems. In the
late 1990s and even during the beginning of the first decade of this century, relatively
few organizations had data-intensive processing needs that required large clusters: a
handful of internet companies and perhaps a few dozen large corporations. But then,
everything changed. Through a combination of many different factors (falling prices of
disks, rise of user-generated web content, etc.), large-data problems began popping up
everywhere. Data-intensive processing needs became widespread, which drove innova-
tions in distributed computing such as MapReduce—first by Google, and then by Yahoo
13
Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would
cost the same as a 10-machine cluster running for ten hours.
14
Guess when this was written? You may be surprised.
16 CHAPTER 1. INTRODUCTION
and the open source community. This in turn created more demand: when organiza-
tions learned about the availability of effective data analysis tools for large datasets,
they began instrumenting various business processes to gather even more data—driven
by the belief that more data leads to deeper insights and greater competitive advantages.
Today, not only are large-data problems ubiquitous, but technological solutions for ad-
dressing them are widely accessible. Anyone can download the open source Hadoop
implementation of MapReduce, pay a modest fee to rent a cluster from a utility cloud
provider, and be happily processing terabytes upon terabytes of data within the week.
Finally, the computer scientists are right—the age of parallel computing has begun,
both in terms of multiple cores in a chip and multiple machines in a cluster (each of
which often has multiple cores).
Why is MapReduce important? In practical terms, it provides a very effective tool
for tackling large-data problems. But beyond that, MapReduce is important in how it
has changed the way we organize computations at a massive scale. MapReduce repre-
sents the first widely-adopted step away from the von Neumann model that has served
as the foundation of computer science over the last half plus century. Valiant called this
a bridging model [148], a conceptual bridge between the physical implementation of a
machine and the software that is to be executed on that machine. Until recently, the
von Neumann model has served us well: Hardware designers focused on efficient imple-
mentations of the von Neumann model and didn’t have to think much about the actual
software that would run on the machines. Similarly, the software industry developed
software targeted at the model without worrying about the hardware details. The result
was extraordinary growth: chip designers churned out successive generations of increas-
ingly powerful processors, and software engineers were able to develop applications in
high-level languages that exploited those processors.
Today, however, the von Neumann model isn’t sufficient anymore: we can’t treat
a multi-core processor or a large cluster as an agglomeration of many von Neumann
machine instances communicating over some interconnect. Such a view places too much
burden on the software developer to effectively take advantage of available computa-
tional resources—it simply is the wrong level of abstraction. MapReduce can be viewed
as the first breakthrough in the quest for new abstractions that allow us to organize
computations, not over individual machines, but over entire clusters. As Barroso puts
it, the datacenter is the computer [18, 119].
To be fair, MapReduce is certainly not the first model of parallel computation
that has been proposed. The most prevalent model in theoretical computer science,
which dates back several decades, is the PRAM [77, 60].
15
In the model, an arbitrary
number of processors, sharing an unboundedly large memory, operate synchronously on
a shared input to produce some output. Other models include LogP [43] and BSP [148].
15
More than a theoretical model, the PRAM has been recently prototyped in hardware [153].
1.4. WHAT THIS BOOK IS NOT 17
For reasons that are beyond the scope of this book, none of these previous models have
enjoyed the success that MapReduce has in terms of adoption and in terms of impact
on the daily lives of millions of users.
16
MapReduce is the most successful abstraction over large-scale computational re-
sources we have seen to date. However, as anyone who has taken an introductory
computer science course knows, abstractions manage complexity by hiding details and
presenting well-defined behaviors to users of those abstractions. They, inevitably, are
imperfect—making certain tasks easier but others more difficult, and sometimes, im-
possible (in the case where the detail suppressed by the abstraction is exactly what
the user cares about). This critique applies to MapReduce: it makes certain large-data
problems easier, but suffers from limitations as well. This means that MapReduce is
not the final word, but rather the first in a new class of programming models that will
allow us to more effectively organize computations at a massive scale.
So if MapReduce is only the beginning, what’s next beyond MapReduce? We’re
getting ahead of ourselves, as we can’t meaningfully answer this question before thor-
oughly understanding what MapReduce can and cannot do well. This is exactly the
purpose of this book: let us now begin our exploration.
1.4 WHAT THIS BOOK IS NOT
Actually, not quite yet. . . A final word before we get started. This book is about Map-
Reduce algorithm design, particularly for text processing (and related) applications.
Although our presentation most closely follows the Hadoop open-source implementa-
tion of MapReduce, this book is explicitly not about Hadoop programming. We don’t
for example, discuss APIs, command-line invocations for running jobs, etc. For those
aspects, we refer the reader to Tom White’s excellent book, “Hadoop: The Definitive
Guide”, published by O’Reilly [154].
16
Nevertheless, it is important to understand the relationship between MapReduce and existing models so that we
can bring to bear accumulated knowledge about parallel algorithms; for example, Karloff et al. [82] demonstrated
that a large class of PRAM algorithms can be efficiently simulated via MapReduce.
18
C H A P T E R 2
MapReduce Basics
The only feasible approach to tackling large-data problems today is to divide and con-
quer, a fundamental concept in computer science that is introduced very early in typical
undergraduate curricula. The basic idea is to partition a large problem into smaller sub-
problems. To the extent that the sub-problems are independent [5], they can be tackled
in parallel by different workers—threads in a processor core, cores in a multi-core pro-
cessor, multiple processors in a machine, or many machines in a cluster. Intermediate
results from each individual worker are then combined to yield the final output.
1
The general principles behind divide-and-conquer algorithms are broadly applica-
ble to a wide range of problems in many different application domains. However, the
details of their implementations are varied and complex. For example, the following are
just some of the issues that need to be addressed:
• How do we break up a large problem into smaller tasks? More specifically, how do
we decompose the problem so that the smaller tasks can be executed in parallel?
• How do we assign tasks to workers distributed across a potentially large number
of machines (while keeping in mind that some workers are better suited to running
some tasks than others, e.g., due to available resources, locality constraints, etc.)?
• How do we ensure that the workers get the data they need?
• How do we coordinate synchronization among the different workers?
• How do we share partial results from one worker that is needed by another?
• How do we accomplish all of the above in the face of software errors and hardware
faults?
In traditional parallel or distributed programming environments, the developer
needs to explicitly address many (and sometimes, all) of the above issues. In shared
memory programming, the developer needs to explicitly coordinate access to shared
data structures through synchronization primitives such as mutexes, to explicitly han-
dle process synchronization through devices such as barriers, and to remain ever vigilant
for common problems such as deadlocks and race conditions. Language extensions, like
1
We note that promising technologies such as quantum or biological computing could potentially induce a
paradigm shift, but they are far from being sufficiently mature to solve real world problems.
19
OpenMP for shared memory parallelism,
2
or libraries implementing the Message Pass-
ing Interface (MPI) for cluster-level parallelism,
3
provide logical abstractions that hide
details of operating system synchronization and communications primitives. However,
even with these extensions, developers are still burdened to keep track of how resources
are made available to workers. Additionally, these frameworks are mostly designed to
tackle processor-intensive problems and have only rudimentary support for dealing with
very large amounts of input data. When using existing parallel computing approaches
for large-data computation, the programmer must devote a significant amount of at-
tention to low-level system details, which detracts from higher-level problem solving.
One of the most significant advantages of MapReduce is that it provides an ab-
straction that hides many system-level details from the programmer. Therefore, a devel-
oper can focus on what computations need to be performed, as opposed to how those
computations are actually carried out or how to get the data to the processes that
depend on them. Like OpenMP and MPI, MapReduce provides a means to distribute
computation without burdening the programmer with the details of distributed com-
puting (but at a different level of granularity). However, organizing and coordinating
large amounts of computation is only part of the challenge. Large-data processing by
definition requires bringing data and code together for computation to occur—no small
feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses
this challenge by providing a simple abstraction for the developer, transparently han-
dling most of the details behind the scenes in a scalable, robust, and efficient manner.
As we mentioned in Chapter 1, instead of moving large amounts of data around, it is far
more efficient, if possible, to move the code to the data. This is operationally realized
by spreading data across the local disks of nodes in a cluster and running processes
on nodes that hold the data. The complex task of managing storage in such a process-
ing environment is typically handled by a distributed file system that sits underneath
MapReduce.
This chapter introduces the MapReduce programming model and the underlying
distributed file system. We start in Section 2.1 with an overview of functional program-
ming, from which MapReduce draws its inspiration. Section 2.2 introduces the basic
programming model, focusing on mappers and reducers. Section 2.3 discusses the role
of the execution framework in actually running MapReduce programs (called jobs).
Section 2.4 fills in additional details by introducing partitioners and combiners, which
provide greater control over data flow. MapReduce would not be practical without a
tightly-integrated distributed file system that manages the data being processed; Sec-
tion 2.5 covers this in detail. Tying everything together, a complete cluster architecture
is described in Section 2.6 before the chapter ends with a summary.
2
http://www.openmp.org/
3
http://www.mcs.anl.gov/mpi/
20 CHAPTER 2. MAPREDUCE BASICS
f f f f f
g g g g g
Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together
in functional programming: map takes a function f and applies it to every element in a list,
while fold iteratively applies a function g to aggregate results.
2.1 FUNCTIONAL PROGRAMMING ROOTS
MapReduce has its roots in functional programming, which is exemplified in languages
such as Lisp and ML.
4
A key feature of functional languages is the concept of higher-
order functions, or functions that can accept other functions as arguments. Two common
built-in higher order functions are map and fold, illustrated in Figure 2.1. Given a list,
map takes as an argument a function f (that takes a single argument) and applies it to
all elements in a list (the top part of the diagram). Given a list, fold takes as arguments
a function g (that takes two arguments) and an initial value: g is first applied to the
initial value and the first item in the list, the result of which is stored in an intermediate
variable. This intermediate variable and the next item in the list serve as the arguments
to a second application of g, the results of which are stored in the intermediate variable.
This process repeats until all items in the list have been consumed; fold then returns the
final value of the intermediate variable. Typically, map and fold are used in combination.
For example, to compute the sum of squares of a list of integers, one could map a function
that squares its argument (i.e., λx.x
2
) over the input list, and then fold the resulting list
with the addition function (more precisely, λxλy.x + y) using an initial value of zero.
We can view map as a concise way to represent the transformation of a dataset
(as defined by the function f). In the same vein, we can view fold as an aggregation
operation, as defined by the function g. One immediate observation is that the appli-
cation of f to each item in a list (or more generally, to elements in a large dataset)
4
However, there are important characteristics of MapReduce that make it non-functional in nature—this will
become apparent later.
2.1. FUNCTIONAL PROGRAMMING ROOTS 21
can be parallelized in a straightforward manner, since each functional application hap-
pens in isolation. In a cluster, these operations can be distributed across many differ-
ent machines. The fold operation, on the other hand, has more restrictions on data
locality—elements in the list must be “brought together” before the function g can be
applied. However, many real-world applications do not require g to be applied to all
elements of the list. To the extent that elements in the list can be divided into groups,
the fold aggregations can also proceed in parallel. Furthermore, for operations that are
commutative and associative, significant efficiencies can be gained in the fold operation
through local aggregation and appropriate reordering.
In a nutshell, we have described MapReduce. The map phase in MapReduce
roughly corresponds to the map operation in functional programming, whereas the
reduce phase in MapReduce roughly corresponds to the fold operation in functional
programming. As we will discuss in detail shortly, the MapReduce execution framework
coordinates the map and reduce phases of processing over large amounts of data on
large clusters of commodity machines.
Viewed from a slightly different angle, MapReduce codifies a generic “recipe” for
processing large datasets that consists of two stages. In the first stage, a user-specified
computation is applied over all input records in a dataset. These operations occur in
parallel and yield intermediate output that is then aggregated by another user-specified
computation. The programmer defines these two types of computations, and the exe-
cution framework coordinates the actual processing (very loosely, MapReduce provides
a functional abstraction). Although such a two-stage processing structure may appear
to be very restrictive, many interesting algorithms can be expressed quite concisely—
especially if one decomposes complex algorithms into a sequence of MapReduce jobs.
Subsequent chapters in this book focus on how a number of algorithms can be imple-
mented in MapReduce.
To be precise, MapReduce can refer to three distinct but related concepts. First,
MapReduce is a programming model, which is the sense discussed above. Second, Map-
Reduce can refer to the execution framework (i.e., the “runtime”) that coordinates the
execution of programs written in this particular style. Finally, MapReduce can refer to
the software implementation of the programming model and the execution framework:
for example, Google’s proprietary implementation vs. the open-source Hadoop imple-
mentation in Java. And in fact, there are many implementations of MapReduce, e.g.,
targeted specifically for multi-core processors [127], for GPGPUs [71], for the CELL ar-
chitecture [126], etc. There are some differences between the MapReduce programming
model implemented in Hadoop and Google’s proprietary implementation, which we will
explicitly discuss throughout the book. However, we take a rather Hadoop-centric view
of MapReduce, since Hadoop remains the most mature and accessible implementation
to date, and therefore the one most developers are likely to use.
22 CHAPTER 2. MAPREDUCE BASICS
2.2 MAPPERS AND REDUCERS
Key-value pairs form the basic data structure in MapReduce. Keys and values may be
primitives such as integers, floating point values, strings, and raw bytes, or they may
be arbitrarily complex structures (lists, tuples, associative arrays, etc.). Programmers
typically need to define their own custom data types, although a number of libraries
such as Protocol Buffers,
5
Thrift,
6
and Avro
7
simplify the task.
Part of the design of MapReduce algorithms involves imposing the key-value struc-
ture on arbitrary datasets. For a collection of web pages, keys may be URLs and values
may be the actual HTML content. For a graph, keys may represent node ids and values
may contain the adjacency lists of those nodes (see Chapter 5 for more details). In some
algorithms, input keys are not particularly meaningful and are simply ignored during
processing, while in other cases input keys are used to uniquely identify a datum (such
as a record id). In Chapter 3, we discuss the role of complex keys and values in the
design of various algorithms.
In MapReduce, the programmer defines a mapper and a reducer with the following
signatures:
map: (k
1
, v
1
) → [(k
2
, v
2
)]
reduce: (k
2
, [v
2
]) → [(k
3
, v
3
)]
The convention [. . .] is used throughout this book to denote a list. The input to a
MapReduce job starts as data stored on the underlying distributed file system (see Sec-
tion 2.5). The mapper is applied to every input key-value pair (split across an arbitrary
number of files) to generate an arbitrary number of intermediate key-value pairs. The
reducer is applied to all values associated with the same intermediate key to generate
output key-value pairs.
8
Implicit between the map and reduce phases is a distributed
“group by” operation on intermediate keys. Intermediate data arrive at each reducer
in order, sorted by the key. However, no ordering relationship is guaranteed for keys
across different reducers. Output key-value pairs from each reducer are written persis-
tently back onto the distributed file system (whereas intermediate key-value pairs are
transient and not preserved). The output ends up in r files on the distributed file system,
where r is the number of reducers. For the most part, there is no need to consolidate
reducer output, since the r files often serve as input to yet another MapReduce job.
Figure 2.2 illustrates this two-stage processing structure.
A simple word count algorithm in MapReduce is shown in Figure 2.3. This algo-
rithm counts the number of occurrences of every word in a text collection, which may
be the first step in, for example, building a unigram language model (i.e., probability
5
http://code.google.com/p/protobuf/
6
http://incubator.apache.org/thrift/
7
http://hadoop.apache.org/avro/
8
This characterization, while conceptually accurate, is a slight simplification. See Section 2.6 for more details.
2.2. MAPPERS AND REDUCERS 23
A B C D E F α β γ δ ε ζ
b 1 2 3 6 5 2 b 7 8
mapper mapper mapper mapper
Shuffle and Sort: aggregate values by keys
b a 1 2 c c 3 6 a c 5 2 b c 7 8
a 1 5 b 2 7 c 2 9 8
reducer reducer reducer
X 5 Y 7 Z 9
Figure 2.2: Simplified view of MapReduce. Mappers are applied to all input key-value pairs,
which generate an arbitrary number of intermediate key-value pairs. Reducers are applied to
all values associated with the same key. Between the map and reduce phases lies a barrier that
involves a large distributed sort and group by.
1: class Mapper
2: method Map(docid a, doc d)
3: for all term t ∈ doc d do
4: Emit(term t, count 1)
1: class Reducer
2: method Reduce(term t, counts [c
1
, c
2
, . . .])
3: sum ← 0
4: for all count c ∈ counts [c
1
, c
2
, . . .] do
5: sum ← sum + c
6: Emit(term t, count sum)
Figure 2.3: Pseudo-code for the word count algorithm in MapReduce. The mapper emits an
intermediate key-value pair for each word in a document. The reducer sums up all counts for
each word.
24 CHAPTER 2. MAPREDUCE BASICS
distribution over words in a collection). Input key-values pairs take the form of (docid,
doc) pairs stored on the distributed file system, where the former is a unique identifier
for the document, and the latter is the text of the document itself. The mapper takes
an input key-value pair, tokenizes the document, and emits an intermediate key-value
pair for every word: the word itself serves as the key, and the integer one serves as the
value (denoting that we’ve seen the word once). The MapReduce execution framework
guarantees that all values associated with the same key are brought together in the
reducer. Therefore, in our word count algorithm, we simply need to sum up all counts
(ones) associated with each word. The reducer does exactly this, and emits final key-
value pairs with the word as the key, and the count as the value. Final output is written
to the distributed file system, one file per reducer. Words within each file will be sorted
by alphabetical order, and each file will contain roughly the same number of words. The
partitioner, which we discuss later in Section 2.4, controls the assignment of words to
reducers. The output can be examined by the programmer or used as input to another
MapReduce program.
There are some differences between the Hadoop implementation of MapReduce
and Google’s implementation.
9
In Hadoop, the reducer is presented with a key and an
iterator over all values associated with the particular key. The values are arbitrarily
ordered. Google’s implementation allows the programmer to specify a secondary sort
key for ordering the values (if desired)—in which case values associated with each key
would be presented to the developer’s reduce code in sorted order. Later in Section 3.4
we discuss how to overcome this limitation in Hadoop to perform secondary sorting.
Another difference: in Google’s implementation the programmer is not allowed to change
the key in the reducer. That is, the reducer output key must be exactly the same as the
reducer input key. In Hadoop, there is no such restriction, and the reducer can emit an
arbitrary number of output key-value pairs (with different keys).
To provide a bit more implementation detail: pseudo-code provided in this book
roughly mirrors how MapReduce programs are written in Hadoop. Mappers and reduc-
ers are objects that implement the Map and Reduce methods, respectively. In Hadoop,
a mapper object is initialized for each map task (associated with a particular sequence
of key-value pairs called an input split) and the Map method is called on each key-value
pair by the execution framework. In configuring a MapReduce job, the programmer pro-
vides a hint on the number of map tasks to run, but the execution framework (see next
section) makes the final determination based on the physical layout of the data (more
details in Section 2.5 and Section 2.6). The situation is similar for the reduce phase:
a reducer object is initialized for each reduce task, and the Reduce method is called
once per intermediate key. In contrast with the number of map tasks, the programmer
can precisely specify the number of reduce tasks. We will return to discuss the details
9
Personal communication, Jeff Dean.
2.2. MAPPERS AND REDUCERS 25
of Hadoop job execution in Section 2.6, which is dependent on an understanding of
the distributed file system (covered in Section 2.5). To reiterate: although the presen-
tation of algorithms in this book closely mirrors the way they would be implemented
in Hadoop, our focus is on algorithm design and conceptual understanding—not actual
Hadoop programming. For that, we would recommend Tom White’s book [154].
What are the restrictions on mappers and reducers? Mappers and reducers can
express arbitrary computations over their inputs. However, one must generally be careful
about use of external resources since multiple mappers or reducers may be contending
for those resources. For example, it may be unwise for a mapper to query an external
SQL database, since that would introduce a scalability bottleneck on the number of map
tasks that could be run in parallel (since they might all be simultaneously querying the
database).
10
In general, mappers can emit an arbitrary number of intermediate key-value
pairs, and they need not be of the same type as the input key-value pairs. Similarly,
reducers can emit an arbitrary number of final key-value pairs, and they can differ
in type from the intermediate key-value pairs. Although not permitted in functional
programming, mappers and reducers can have side effects. This is a powerful and useful
feature: for example, preserving state across multiple inputs is central to the design of
many MapReduce algorithms (see Chapter 3). Such algorithms can be understood as
having side effects that only change state that is internal to the mapper or reducer.
While the correctness of such algorithms may be more difficult to guarantee (since the
function’s behavior depends not only on the current input but on previous inputs),
most potential synchronization problems are avoided since internal state is private only
to individual mappers and reducers. In other cases (see Section 4.4 and Section 6.5), it
may be useful for mappers or reducers to have external side effects, such as writing files
to the distributed file system. Since many mappers and reducers are run in parallel, and
the distributed file system is a shared global resource, special care must be taken to
ensure that such operations avoid synchronization conflicts. One strategy is to write a
temporary file that is renamed upon successful completion of the mapper or reducer [45].
In addition to the “canonical” MapReduce processing flow, other variations are
also possible. MapReduce programs can contain no reducers, in which case mapper
output is directly written to disk (one file per mapper). For embarrassingly parallel
problems, e.g., parse a large text collection or independently analyze a large number of
images, this would be a common pattern. The converse—a MapReduce program with
no mappers—is not possible, although in some cases it is useful for the mapper to imple-
ment the identity function and simply pass input key-value pairs to the reducers. This
has the effect of sorting and regrouping the input for reduce-side processing. Similarly,
in some cases it is useful for the reducer to implement the identity function, in which
case the program simply sorts and groups mapper output. Finally, running identity
10
Unless, of course, the database itself is highly scalable.
26 CHAPTER 2. MAPREDUCE BASICS
mappers and reducers has the effect of regrouping and resorting the input data (which
is sometimes useful).
Although in the most common case, input to a MapReduce job comes from data
stored on the distributed file system and output is written back to the distributed file
system, any other system that satisfies the proper abstractions can serve as a data source
or sink. With Google’s MapReduce implementation, BigTable [34], a sparse, distributed,
persistent multidimensional sorted map, is frequently used as a source of input and as
a store of MapReduce output. HBase is an open-source BigTable clone and has similar
capabilities. Also, Hadoop has been integrated with existing MPP (massively parallel
processing) relational databases, which allows a programmer to write MapReduce jobs
over database rows and dump output into a new database table. Finally, in some cases
MapReduce jobs may not consume any input at all (e.g., computing π) or may only
consume a small amount of data (e.g., input parameters to many instances of processor-
intensive simulations running in parallel).
2.3 THE EXECUTION FRAMEWORK
One of the most important idea behind MapReduce is separating the what of distributed
processing from the how. A MapReduce program, referred to as a job, consists of code
for mappers and reducers (as well as combiners and partitioners to be discussed in the
next section) packaged together with configuration parameters (such as where the in-
put lies and where the output should be stored). The developer submits the job to the
submission node of a cluster (in Hadoop, this is called the jobtracker) and execution
framework (sometimes called the “runtime”) takes care of everything else: it transpar-
ently handles all other aspects of distributed code execution, on clusters ranging from
a single node to a few thousand nodes. Specific responsibilities include:
Scheduling. Each MapReduce job is divided into smaller units called tasks (see Sec-
tion 2.6 for more details). For example, a map task may be responsible for processing
a certain block of input key-value pairs (called an input split in Hadoop); similarly, a
reduce task may handle a portion of the intermediate key space. It is not uncommon
for MapReduce jobs to have thousands of individual tasks that need to be assigned to
nodes in the cluster. In large jobs, the total number of tasks may exceed the number of
tasks that can be run on the cluster concurrently, making it necessary for the scheduler
to maintain some sort of a task queue and to track the progress of running tasks so
that waiting tasks can be assigned to nodes as they become available. Another aspect
of scheduling involves coordination among tasks belonging to different jobs (e.g., from
different users). How can a large, shared resource support several users simultaneously
in a predictable, transparent, policy-driven fashion? There has been some recent work
along these lines in the context of Hadoop [131, 160].
2.3. THE EXECUTION FRAMEWORK 27
Speculative execution is an optimization that is implemented by both Hadoop and
Google’s MapReduce implementation (called “backup tasks” [45]). Due to the barrier
between the map and reduce tasks, the map phase of a job is only as fast as the slowest
map task. Similarly, the completion time of a job is bounded by the running time of the
slowest reduce task. As a result, the speed of a MapReduce job is sensitive to what are
known as stragglers, or tasks that take an usually long time to complete. One cause of
stragglers is flaky hardware: for example, a machine that is suffering from recoverable
errors may become significantly slower. With speculative execution, an identical copy
of the same task is executed on a different machine, and the framework simply uses the
result of the first task attempt to finish. Zaharia et al. [161] presented different execution
strategies in a recent paper, and Google has reported that speculative execution can
improve job running times by 44% [45]. Although in Hadoop both map and reduce tasks
can be speculatively executed, the common wisdom is that the technique is more helpful
for map tasks than reduce tasks, since each copy of the reduce task needs to pull data
over the network. Note, however, that speculative execution cannot adequately address
another common cause of stragglers: skew in the distribution of values associated with
intermediate keys (leading to reduce stragglers). In text processing we often observe
Zipfian distributions, which means that the task or tasks responsible for processing the
most frequent few elements will run much longer than the typical task. Better local
aggregation, discussed in the next chapter, is one possible solution to this problem.
Data/code co-location. The phrase data distribution is misleading, since one of the
key ideas behind MapReduce is to move the code, not the data. However, the more
general point remains—in order for computation to occur, we need to somehow feed
data to the code. In MapReduce, this issue is inexplicably intertwined with scheduling
and relies heavily on the design of the underlying distributed file system.
11
To achieve
data locality, the scheduler starts tasks on the node that holds a particular block of data
(i.e., on its local drive) needed by the task. This has the effect of moving code to the
data. If this is not possible (e.g., a node is already running too many tasks), new tasks
will be started elsewhere, and the necessary data will be streamed over the network.
An important optimization here is to prefer nodes that are on the same rack in the
datacenter as the node holding the relevant data block, since inter-rack bandwidth is
significantly less than intra-rack bandwidth.
Synchronization. In general, synchronization refers to the mechanisms by which
multiple concurrently running processes “join up”, for example, to share intermediate
results or otherwise exchange state information. In MapReduce, synchronization is ac-
complished by a barrier between the map and reduce phases of processing. Intermediate
key-value pairs must be grouped by key, which is accomplished by a large distributed
11
In the canonical case, that is. Recall that MapReduce may receive its input from other sources.
28 CHAPTER 2. MAPREDUCE BASICS
sort involving all the nodes that executed map tasks and all the nodes that will execute
reduce tasks. This necessarily involves copying intermediate data over the network, and
therefore the process is commonly known as “shuffle and sort”. A MapReduce job with
m mappers and r reducers involves up to mr distinct copy operations, since each
mapper may have intermediate output going to every reducer.
Note that the reduce computation cannot start until all the mappers have fin-
ished emitting key-value pairs and all intermediate key-value pairs have been shuffled
and sorted, since the execution framework cannot otherwise guarantee that all values
associated with the same key have been gathered. This is an important departure from
functional programming: in a fold operation, the aggregation function g is a function of
the intermediate value and the next item in the list—which means that values can be
lazily generated and aggregation can begin as soon as values are available. In contrast,
the reducer in MapReduce receives all values associated with the same key at once.
However, it is possible to start copying intermediate key-value pairs over the network
to the nodes running the reducers as soon as each mapper finishes—this is a common
optimization and implemented in Hadoop.
Error and fault handling. The MapReduce execution framework must accomplish
all the tasks above in an environment where errors and faults are the norm, not the
exception. Since MapReduce was explicitly designed around low-end commodity servers,
the runtime must be especially resilient. In large clusters, disk failures are common [123]
and RAM experiences more errors than one might expect [135]. Datacenters suffer
from both planned outages (e.g., system maintenance and hardware upgrades) and
unexpected outages (e.g., power failure, connectivity loss, etc.).
And that’s just hardware. No software is bug free—exceptions must be appropri-
ately trapped, logged, and recovered from. Large-data problems have a penchant for
uncovering obscure corner cases in code that is otherwise thought to be bug-free. Fur-
thermore, any sufficiently large dataset will contain corrupted data or records that are
mangled beyond a programmer’s imagination—resulting in errors that one would never
think to check for or trap. The MapReduce execution framework must thrive in this
hostile environment.
2.4 PARTITIONERS AND COMBINERS
We have thus far presented a simplified view of MapReduce. There are two additional
elements that complete the programming model: partitioners and combiners.
Partitioners are responsible for dividing up the intermediate key space and assign-
ing intermediate key-value pairs to reducers. In other words, the partitioner specifies
the task to which an intermediate key-value pair must be copied. Within each reducer,
keys are processed in sorted order (which is how the “group by” is implemented). The
2.4. PARTITIONERS AND COMBINERS 29
simplest partitioner involves computing the hash value of the key and then taking the
mod of that value with the number of reducers. This assigns approximately the same
number of keys to each reducer (dependent on the quality of the hash function). Note,
however, that the partitioner only considers the key and ignores the value—therefore, a
roughly-even partitioning of the key space may nevertheless yield large differences in the
number of key-values pairs sent to each reducer (since different keys may have different
numbers of associated values). This imbalance in the amount of data associated with
each key is relatively common in many text processing applications due to the Zipfian
distribution of word occurrences.
Combiners are an optimization in MapReduce that allow for local aggregation
before the shuffle and sort phase. We can motivate the need for combiners by considering
the word count algorithm in Figure 2.3, which emits a key-value pair for each word
in the collection. Furthermore, all these key-value pairs need to be copied across the
network, and so the amount of intermediate data will be larger than the input collection
itself. This is clearly inefficient. One solution is to perform local aggregation on the
output of each mapper, i.e., to compute a local count for a word over all the documents
processed by the mapper. With this modification (assuming the maximum amount of
local aggregation possible), the number of intermediate key-value pairs will be at most
the number of unique words in the collection times the number of mappers (and typically
far smaller because each mapper may not encounter every word).
The combiner in MapReduce supports such an optimization. One can think of
combiners as “mini-reducers” that take place on the output of the mappers, prior to the
shuffle and sort phase. Each combiner operates in isolation and therefore does not have
access to intermediate output from other mappers. The combiner is provided keys and
values associated with each key (the same types as the mapper output keys and values).
Critically, one cannot assume that a combiner will have the opportunity to process all
values associated with the same key. The combiner can emit any number of key-value
pairs, but the keys and values must be of the same type as the mapper output (same as
the reducer input).
12
In cases where an operation is both associative and commutative
(e.g., addition or multiplication), reducers can directly serve as combiners. In general,
however, reducers and combiners are not interchangeable.
In many cases, proper use of combiners can spell the difference between an imprac-
tical algorithm and an efficient algorithm. This topic will be discussed in Section 3.1,
which focuses on various techniques for local aggregation. It suffices to say for now that
12
A note on the implementation of combiners in Hadoop: by default, the execution framework reserves the right
to use combiners at its discretion. In reality, this means that a combiner may be invoked zero, one, or multiple
times. In addition, combiners in Hadoop may actually be invoked in the reduce phase, i.e., after key-value pairs
have been copied over to the reducer, but before the user reducer code runs. As a result, combiners must be
carefully written so that they can be executed in these different environments. Section 3.1.2 discusses this in
more detail.
30 CHAPTER 2. MAPREDUCE BASICS
A B C D E F α β γ δ ε ζ
mapper mapper mapper mapper
b a 1 2 c c 3 6 a c 5 2 b c 7 8
combiner combiner combiner combiner
pp pp pp pp
b a 1 2 c 9 a c 5 2 b c 7 8
partitioner partitioner partitioner partitioner
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 9 8
p p p p
reducer reducer reducer
X 5 Y 7 Z 9
Figure 2.4: Complete view of MapReduce, illustrating combiners and partitioners in addi-
tion to mappers and reducers. Combiners can be viewed as “mini-reducers” in the map phase.
Partitioners determine which reducer is responsible for a particular key.
a combiner can significantly reduce the amount of data that needs to be copied over
the network, resulting in much faster algorithms.
The complete MapReduce model is shown in Figure 2.4. Output of the mappers
are processed by the combiners, which perform local aggregation to cut down on the
number of intermediate key-value pairs. The partitioner determines which reducer will
be responsible for processing a particular key, and the execution framework uses this
information to copy the data to the right location during the shuffle and sort phase.
13
Therefore, a complete MapReduce job consists of code for the mapper, reducer, com-
biner, and partitioner, along with job configuration parameters. The execution frame-
work handles everything else.
13
In Hadoop, partitioners are actually executed before combiners, so while Figure 2.4 is conceptually accurate,
it doesn’t precisely describe the Hadoop implementation.
2.5. THE DISTRIBUTED FILE SYSTEM 31
2.5 THE DISTRIBUTED FILE SYSTEM
So far, we have mostly focused on the processing aspect of data-intensive processing,
but it is important to recognize that without data, there is nothing to compute on. In
high-performance computing (HPC) and many traditional cluster architectures, stor-
age is viewed as a distinct and separate component from computation. Implementations
vary widely, but network-attached storage (NAS) and storage area networks (SAN) are
common; supercomputers often have dedicated subsystems for handling storage (sepa-
rate nodes, and often even separate networks). Regardless of the details, the processing
cycle remains the same at a high level: the compute nodes fetch input from storage, load
the data into memory, process the data, and then write back the results (with perhaps
intermediate checkpointing for long-running processes).
As dataset sizes increase, more compute capacity is required for processing. But as
compute capacity grows, the link between the compute nodes and the storage becomes
a bottleneck. At that point, one could invest in higher performance but more expensive
networks (e.g., 10 gigabit Ethernet) or special-purpose interconnects such as InfiniBand
(even more expensive). In most cases, this is not a cost-effective solution, as the price
of networking equipment increases non-linearly with performance (e.g., a switch with
ten times the capacity is usually more than ten times more expensive). Alternatively,
one could abandon the separation of computation and storage as distinct components
in a cluster. The distributed file system (DFS) that underlies MapReduce adopts ex-
actly this approach. The Google File System (GFS) [57] supports Google’s proprietary
implementation of MapReduce; in the open-source world, HDFS (Hadoop Distributed
File System) is an open-source implementation of GFS that supports Hadoop. Although
MapReduce doesn’t necessarily require the distributed file system, it is difficult to re-
alize many of the advantages of the programming model without a storage substrate
that behaves much like the DFS.
14
Of course, distributed file systems are not new [74, 32, 7, 147, 133]. The Map-
Reduce distributed file system builds on previous work but is specifically adapted to
large-data processing workloads, and therefore departs from previous architectures in
certain respects (see discussion by Ghemawat et al. [57] in the original GFS paper.).
The main idea is to divide user data into blocks and replicate those blocks across the
local disks of nodes in the cluster. Blocking data, of course, is not a new idea, but DFS
blocks are significantly larger than block sizes in typical single-machine file systems (64
MB by default). The distributed file system adopts a master–slave architecture in which
the master maintains the file namespace (metadata, directory structure, file to block
mapping, location of blocks, and access permissions) and the slaves manage the actual
14
However, there is evidence that existing POSIX-based distributed cluster file systems (e.g., GPFS or PVFS)
can serve as a replacement for HDFS, when properly tuned or modified for MapReduce workloads [146, 6].
This, however, remains an experimental use case.
32 CHAPTER 2. MAPREDUCE BASICS
HDFS namenode
(file name, block id)
(block id, block location)
HDFS namenode
File namespace
/foo/bar
block 3df2
Application
HDFS Client
instructions to datanode
datanode state
(block id, byte range)
block data
HDFS datanode
Linux file system
HDFS datanode
Linux file system
… …
Figure 2.5: The architecture of HDFS. The namenode (master) is responsible for maintaining
the file namespace and directing clients to datanodes (slaves) that actually hold data blocks
containing user data.
data blocks. In GFS, the master is called the GFS master, and the slaves are called
GFS chunkservers. In Hadoop, the same roles are filled by the namenode and datan-
odes, respectively.
15
This book adopts the Hadoop terminology, although for most basic
file operations GFS and HDFS work much the same way. The architecture of HDFS is
shown in Figure 2.5, redrawn from a similar diagram describing GFS [57].
In HDFS, an application client wishing to read a file (or a portion thereof) must
first contact the namenode to determine where the actual data is stored. In response
to the client request, the namenode returns the relevant block id and the location
where the block is held (i.e., which datanode). The client then contacts the datanode to
retrieve the data. Blocks are themselves stored on standard single-machine file systems,
so HDFS lies on top of the standard OS stack (e.g., Linux). An important feature of
the design is that data is never moved through the namenode. Instead, all data transfer
occurs directly between clients and datanodes; communications with the namenode only
involves transfer of metadata.
By default, HDFS stores three separate copies of each data block to ensure both
reliability, availability, and performance. In large clusters, the three replicas are spread
across different physical racks, so HDFS is resilient towards two common failure sce-
narios: individual datanode crashes and failures in networking equipment that bring
an entire rack offline. Replicating blocks across physical machines also increases oppor-
15
To be precise, namenode and datanode may refer to physical machines in a cluster, or they may refer to daemons
running on those machines providing the relevant services.
2.5. THE DISTRIBUTED FILE SYSTEM 33
tunities to co-locate data and processing in the scheduling of MapReduce jobs, since
multiple copies yield more opportunities to exploit locality. The namenode is in periodic
communication with the datanodes to ensure proper replication of all the blocks: if there
aren’t enough replicas (e.g., due to disk or machine failures or to connectivity losses
due to networking equipment failures), the namenode directs the creation of additional
copies;
16
if there are too many replicas (e.g., a repaired node rejoins the cluster), extra
copies are discarded.
To create a new file and write data to HDFS, the application client first contacts
the namenode, which updates the file namespace after checking permissions and making
sure the file doesn’t already exist. The namenode allocates a new block on a suitable
datanode, and the application is directed to stream data directly to it. From the initial
datanode, data is further propagated to additional replicas. In the most recent release of
Hadoop as of this writing (release 0.20.2), files are immutable—they cannot be modified
after creation. There are current plans to officially support file appends in the near
future, which is a feature already present in GFS.
In summary, the HDFS namenode has the following responsibilities:
• Namespace management. The namenode is responsible for maintaining the file
namespace, which includes metadata, directory structure, file to block mapping,
location of blocks, and access permissions. These data are held in memory for fast
access and all mutations are persistently logged.
• Coordinating file operations. The namenode directs application clients to datan-
odes for read operations, and allocates blocks on suitable datanodes for write
operations. All data transfers occur directly between clients and datanodes. When
a file is deleted, HDFS does not immediately reclaim the available physical storage;
rather, blocks are lazily garbage collected.
• Maintaining overall health of the file system. The namenode is in periodic contact
with the datanodes via heartbeat messages to ensure the integrity of the system.
If the namenode observes that a data block is under-replicated (fewer copies are
stored on datanodes than the desired replication factor), it will direct the creation
of new replicas. Finally, the namenode is also responsible for rebalancing the file
system.
17
During the course of normal operations, certain datanodes may end up
holding more blocks than others; rebalancing involves moving blocks from datan-
odes with more blocks to datanodes with fewer blocks. This leads to better load
balancing and more even disk utilization.
16
Note that the namenode coordinates the replication process, but data transfer occurs directly from datanode
to datanode.
17
In Hadoop, this is a manually-invoked process.
34 CHAPTER 2. MAPREDUCE BASICS
Since GFS and HDFS were specifically designed to support Google’s proprietary and
the open-source implementation of MapReduce, respectively, they were designed with
a number of assumptions about the operational environment, which in turn influenced
the design of the systems. Understanding these choices is critical to designing effective
MapReduce algorithms:
• The file system stores a relatively modest number of large files. The definition of
“modest” varies by the size of the deployment, but in HDFS multi-gigabyte files
are common (and even encouraged). There are several reasons why lots of small
files are to be avoided. Since the namenode must hold all file metadata in memory,
this presents an upper bound on both the number of files and blocks that can
be supported.
18
Large multi-block files represent a more efficient use of namenode
memory than many single-block files (each of which consumes less space than a
single block size). In addition, mappers in a MapReduce job use individual files as
a basic unit for splitting input data. At present, there is no default mechanism in
Hadoop that allows a mapper to process multiple files. As a result, mapping over
many small files will yield as many map tasks as there are files. This results in
two potential problems: first, the startup costs of mappers may become significant
compared to the time spent actually processing input key-value pairs; second, this
may result in an excessive amount of across-the-network copy operations during
the “shuffle and sort” phase (recall that a MapReduce job with m mappers and r
reducers involves up to mr distinct copy operations).
• Workloads are batch oriented, dominated by long streaming reads and large se-
quential writes. As a result, high sustained bandwidth is more important than low
latency. This exactly describes the nature of MapReduce jobs, which are batch
operations on large amounts of data. Due to the common-case workload, both
HDFS and GFS do not implement any form of data caching.
19
• Applications are aware of the characteristics of the distributed file system. Neither
HDFS nor GFS present a general POSIX-compliant API, but rather support only
a subset of possible file operations. This simplifies the design of the distributed
file system, and in essence pushes part of the data management onto the end
application. One rationale for this decision is that each application knows best
how to handle data specific to that application, for example, in terms of resolving
inconsistent states and optimizing the layout of data structures.
18
According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008, each block in HDFS occupies
about 150 bytes of memory on the namenode.
19
However, since the distributed file system is built on top of a standard operating system such as Linux, there
is still OS-level caching.
2.5. THE DISTRIBUTED FILE SYSTEM 35
• The file system is deployed in an environment of cooperative users. There is no
discussion of security in the original GFS paper, but HDFS explicitly assumes a
datacenter environment where only authorized users have access. File permissions
in HDFS are only meant to prevent unintended operations and can be easily
circumvented.
20
• The system is built from unreliable but inexpensive commodity components. As a
result, failures are the norm rather than the exception. HDFS is designed around
a number of self-monitoring and self-healing mechanisms to robustly cope with
common failure modes.
Finally, some discussion is necessary to understand the single-master design of HDFS
and GFS. It has been demonstrated that in large-scale distributed systems, simultane-
ously providing consistency, availability, and partition tolerance is impossible—this is
Brewer’s so-called CAP Theorem [58]. Since partitioning is unavoidable in large-data
systems, the real tradeoff is between consistency and availability. A single-master de-
sign trades availability for consistency and significantly simplifies implementation. If the
master (HDFS namenode or GFS master) goes down, the entire file system becomes
unavailable, which trivially guarantees that the file system will never be in an incon-
sistent state. An alternative design might involve multiple masters that jointly manage
the file namespace—such an architecture would increase availability (if one goes down,
another can step in) at the cost of consistency, not to mention requiring a more complex
implementation (cf. [4, 105]).
The single-master design of GFS and HDFS is a well-known weakness, since if
the master goes offline, the entire file system and all MapReduce jobs running on top
of it will grind to a halt. This weakness is mitigated in part by the lightweight nature
of file system operations. Recall that no data is ever moved through the namenode and
that all communication between clients and datanodes involve only metadata. Because
of this, the namenode rarely is the bottleneck, and for the most part avoids load-
induced crashes. In practice, this single point of failure is not as severe a limitation as
it may appear—with diligent monitoring of the namenode, mean time between failure
measured in months are not uncommon for production deployments. Furthermore, the
Hadoop community is well-aware of this problem and has developed several reasonable
workarounds—for example, a warm standby namenode that can be quickly switched
over when the primary namenode fails. The open source environment and the fact
that many organizations already depend on Hadoop for production systems virtually
guarantees that more effective solutions will be developed over time.
20
However, there are existing plans to integrate Kerberos into Hadoop/HDFS.
36 CHAPTER 2. MAPREDUCE BASICS
namenode
namenode daemon
job submission node
jobtracker
t kt k t kt k t kt k
namenode daemon jobtracker
datanode daemon
Linux file system
tasktracker
datanode daemon
Linux file system
tasktracker
datanode daemon
Linux file system
tasktracker

slave node

slave node

slave node
Figure 2.6: Architecture of a complete Hadoop cluster, which consists of three separate compo-
nents: the HDFS master (called the namenode), the job submission node (called the jobtracker),
and many slave nodes (three shown here). Each of the slave nodes runs a tasktracker for exe-
cuting map and reduce tasks and a datanode daemon for serving HDFS data.
2.6 HADOOP CLUSTER ARCHITECTURE
Putting everything together, the architecture of a complete Hadoop cluster is shown in
Figure 2.6. The HDFS namenode runs the namenode daemon. The job submission node
runs the jobtracker, which is the single point of contact for a client wishing to execute a
MapReduce job. The jobtracker monitors the progress of running MapReduce jobs and
is responsible for coordinating the execution of the mappers and reducers. Typically,
these services run on two separate machines, although in smaller clusters they are often
co-located. The bulk of a Hadoop cluster consists of slave nodes (only three of which
are shown in the figure) that run both a tasktracker, which is responsible for actually
running user code, and a datanode daemon, for serving HDFS data.
A Hadoop MapReduce job is divided up into a number of map tasks and reduce
tasks. Tasktrackers periodically send heartbeat messages to the jobtracker that also
doubles as a vehicle for task allocation. If a tasktracker is available to run tasks (in
Hadoop parlance, has empty task slots), the return acknowledgment of the tasktracker
heartbeat contains task allocation information. The number of reduce tasks is equal
to the number of reducers specified by the programmer. The number of map tasks,
on the other hand, depends on many factors: the number of mappers specified by
the programmer serves as a hint to the execution framework, but the actual number
of tasks depends on both the number of input files and the number of HDFS data
blocks occupied by those files. Each map task is assigned a sequence of input key-value
2.6. HADOOP CLUSTER ARCHITECTURE 37
pairs, called an input split in Hadoop. Input splits are computed automatically and the
execution framework strives to align them to HDFS block boundaries so that each map
task is associated with a single data block. In scheduling map tasks, the jobtracker tries
to take advantage of data locality—if possible, map tasks are scheduled on the slave
node that holds the input split, so that the mapper will be processing local data. The
alignment of input splits with HDFS block boundaries simplifies task scheduling. If it
is not possible to run a map task on local data, it becomes necessary to stream input
key-value pairs across the network. Since large clusters are organized into racks, with
far greater intra-rack bandwidth than inter-rack bandwidth, the execution framework
strives to at least place map tasks on a rack which has a copy of the data block.
Although conceptually in MapReduce one can think of the mapper being applied
to all input key-value pairs and the reducer being applied to all values associated with
the same key, actual job execution is a bit more complex. In Hadoop, mappers are Java
objects with a Map method (among others). A mapper object is instantiated for every
map task by the tasktracker. The life-cycle of this object begins with instantiation,
where a hook is provided in the API to run programmer-specified code. This means
that mappers can read in “side data”, providing an opportunity to load state, static
data sources, dictionaries, etc. After initialization, the Map method is called (by the
execution framework) on all key-value pairs in the input split. Since these method
calls occur in the context of the same Java object, it is possible to preserve state across
multiple input key-value pairs within the same map task—this is an important property
to exploit in the design of MapReduce algorithms, as we will see in the next chapter.
After all key-value pairs in the input split have been processed, the mapper object
provides an opportunity to run programmer-specified termination code. This, too, will
be important in the design of MapReduce algorithms.
The actual execution of reducers is similar to that of the mappers. Each re-
ducer object is instantiated for every reduce task. The Hadoop API provides hooks for
programmer-specified initialization and termination code. After initialization, for each
intermediate key in the partition (defined by the partitioner), the execution framework
repeatedly calls the Reduce method with an intermediate key and an iterator over
all values associated with that key. The programming model also guarantees that in-
termediate keys will be presented to the Reduce method in sorted order. Since this
occurs in the context of a single object, it is possible to preserve state across multiple
intermediate keys (and associated values) within a single reduce task. Once again, this
property is critical in the design of MapReduce algorithms and will be discussed in the
next chapter.
38 CHAPTER 2. MAPREDUCE BASICS
2.7 SUMMARY
This chapter provides a basic overview of the MapReduce programming model, starting
with its roots in functional programming and continuing with a description of mappers,
reducers, partitioners, and combiners. Significant attention is also given to the underly-
ing distributed file system, which is a tightly-integrated component of the MapReduce
environment. Given this basic understanding, we now turn our attention to the design
of MapReduce algorithms.
39
C H A P T E R 3
MapReduce Algorithm Design
A large part of the power of MapReduce comes from its simplicity: in addition to
preparing the input data, the programmer needs only to implement the mapper, the
reducer, and optionally, the combiner and the partitioner. All other aspects of execution
are handled transparently by the execution framework—on clusters ranging from a
single node to a few thousand nodes, over datasets ranging from gigabytes to petabytes.
However, this also means that any conceivable algorithm that a programmer wishes to
develop must be expressed in terms of a small number of rigidly-defined components
that must fit together in very specific ways. It may not appear obvious how a multitude
of algorithms can be recast into this programming model. The purpose of this chapter is
to provide, primarily through examples, a guide to MapReduce algorithm design. These
examples illustrate what can be thought of as “design patterns” for MapReduce, which
instantiate arrangements of components and specific techniques designed to handle
frequently-encountered situations across a variety of problem domains. Two of these
design patterns are used in the scalable inverted indexing algorithm we’ll present later
in Chapter 4; concepts presented here will show up again in Chapter 5 (graph processing)
and Chapter 6 (expectation-maximization algorithms).
Synchronization is perhaps the most tricky aspect of designing MapReduce algo-
rithms (or for that matter, parallel and distributed algorithms in general). Other than
embarrassingly-parallel problems, processes running on separate nodes in a cluster must,
at some point in time, come together—for example, to distribute partial results from
nodes that produced them to the nodes that will consume them. Within a single Map-
Reduce job, there is only one opportunity for cluster-wide synchronization—during the
shuffle and sort stage where intermediate key-value pairs are copied from the mappers
to the reducers and grouped by key. Beyond that, mappers and reducers run in isolation
without any mechanisms for direct communication. Furthermore, the programmer has
little control over many aspects of execution, for example:
• Where a mapper or reducer runs (i.e., on which node in the cluster).
• When a mapper or reducer begins or finishes.
• Which input key-value pairs are processed by a specific mapper.
• Which intermediate key-value pairs are processed by a specific reducer.
Nevertheless, the programmer does have a number of techniques for controlling execu-
tion and managing the flow of data in MapReduce. In summary, they are:
40 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
1. The ability to construct complex data structures as keys and values to store and
communicate partial results.
2. The ability to execute user-specified initialization code at the beginning of a map
or reduce task, and the ability to execute user-specified termination code at the
end of a map or reduce task.
3. The ability to preserve state in both mappers and reducers across multiple input
or intermediate keys.
4. The ability to control the sort order of intermediate keys, and therefore the order
in which a reducer will encounter particular keys.
5. The ability to control the partitioning of the key space, and therefore the set of
keys that will be encountered by a particular reducer.
It is important to realize that many algorithms cannot be easily expressed as a single
MapReduce job. One must often decompose complex algorithms into a sequence of jobs,
which requires orchestrating data so that the output of one job becomes the input to the
next. Many algorithms are iterative in nature, requiring repeated execution until some
convergence criteria—graph algorithms in Chapter 5 and expectation-maximization al-
gorithms in Chapter 6 behave in exactly this way. Often, the convergence check itself
cannot be easily expressed in MapReduce. The standard solution is an external (non-
MapReduce) program that serves as a “driver” to coordinate MapReduce iterations.
This chapter explains how various techniques to control code execution and
data flow can be applied to design algorithms in MapReduce. The focus is both on
scalability—ensuring that there are no inherent bottlenecks as algorithms are applied
to increasingly larger datasets—and efficiency—ensuring that algorithms do not need-
lessly consume resources and thereby reducing the cost of parallelization. The gold
standard, of course, is linear scalability: an algorithm running on twice the amount
of data should take only twice as long. Similarly, an algorithm running on twice the
number of nodes should only take half as long.
The chapter is organized as follows:
• Section 3.1 introduces the important concept of local aggregation in MapReduce
and strategies for designing efficient algorithms that minimize the amount of par-
tial results that need to be copied across the network. The proper use of combiners
is discussed in detail, as well as the “in-mapper combining” design pattern.
• Section 3.2 uses the example of building word co-occurrence matrices on large
text corpora to illustrate two common design patterns, which we dub “pairs” and
“stripes”. These two approaches are useful in a large class of problems that require
keeping track of joint events across a large number of observations.
3.1. LOCAL AGGREGATION 41
• Section 3.3 shows how co-occurrence counts can be converted into relative frequen-
cies using a pattern known as “order inversion”. The sequencing of computations
in the reducer can be recast as a sorting problem, where pieces of intermediate
data are sorted into exactly the order that is required to carry out a series of
computations. Often, a reducer needs to compute an aggregate statistic on a set
of elements before individual elements can be processed. Normally, this would re-
quire two passes over the data, but with the “order inversion” design pattern, the
aggregate statistic can be computed in the reducer before the individual elements
are encountered. This may seem counter-intuitive: how can we compute an aggre-
gate statistic on a set of elements before encountering elements of that set? As it
turns out, clever sorting of special key-value pairs enables exactly this.
• Section 3.4 provides a general solution to secondary sorting, which is the problem
of sorting values associated with a key in the reduce phase. We call this technique
“value-to-key conversion”.
• Section 3.5 covers the topic of performing joins on relational datasets and presents
three different approaches: reduce-side, map-side, and memory-backed joins.
3.1 LOCAL AGGREGATION
In the context of data-intensive distributed processing, the single most important as-
pect of synchronization is the exchange of intermediate results, from the processes that
produced them to the processes that will ultimately consume them. In a cluster environ-
ment, with the exception of embarrassingly-parallel problems, this necessarily involves
transferring data over the network. Furthermore, in Hadoop, intermediate results are
written to local disk before being sent over the network. Since network and disk laten-
cies are relatively expensive compared to other operations, reductions in the amount of
intermediate data translate into increases in algorithmic efficiency. In MapReduce, local
aggregation of intermediate results is one of the keys to efficient algorithms. Through
use of the combiner and by taking advantage of the ability to preserve state across
multiple inputs, it is often possible to substantially reduce both the number and size of
key-value pairs that need to be shuffled from the mappers to the reducers.
3.1.1 COMBINERS AND IN-MAPPER COMBINING
We illustrate various techniques for local aggregation using the simple word count ex-
ample presented in Section 2.2. For convenience, Figure 3.1 repeats the pseudo-code of
the basic algorithm, which is quite simple: the mapper emits an intermediate key-value
pair for each term observed, with the term itself as the key and a value of one; reducers
sum up the partial counts to arrive at the final count.
42 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Map(docid a, doc d)
3: for all term t ∈ doc d do
4: Emit(term t, count 1)
1: class Reducer
2: method Reduce(term t, counts [c
1
, c
2
, . . .])
3: sum ← 0
4: for all count c ∈ counts [c
1
, c
2
, . . .] do
5: sum ← sum + c
6: Emit(term t, count sum)
Figure 3.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from
Figure 2.3).
The first technique for local aggregation is the combiner, already discussed in
Section 2.4. Combiners provide a general mechanism within the MapReduce framework
to reduce the amount of intermediate data generated by the mappers—recall that they
can be understood as “mini-reducers” that process the output of mappers. In this
example, the combiners aggregate term counts across the documents processed by each
map task. This results in a reduction in the number of intermediate key-value pairs that
need to be shuffled across the network—from the order of total number of terms in the
collection to the order of the number of unique terms in the collection.
1
An improvement on the basic algorithm is shown in Figure 3.2 (the mapper is
modified but the reducer remains the same as in Figure 3.1 and therefore is not re-
peated). An associative array (i.e., Map in Java) is introduced inside the mapper to
tally up term counts within a single document: instead of emitting a key-value pair for
each term in the document, this version emits a key-value pair for each unique term in
the document. Given that some words appear frequently within a document (for exam-
ple, a document about dogs is likely to have many occurrences of the word “dog”), this
can yield substantial savings in the number of intermediate key-value pairs emitted,
especially for long documents.
1
More precisely, if the combiners take advantage of all opportunities for local aggregation, the algorithm would
generate at most m×V intermediate key-value pairs, where m is the number of mappers and V is the vo-
cabulary size (number of unique terms in the collection), since every term could have been observed in every
mapper. However, there are two additional factors to consider. Due to the Zipfian nature of term distributions,
most terms will not be observed by most mappers (for example, terms that occur only once will by definition
only be observed by one mapper). On the other hand, combiners in Hadoop are treated as optional optimiza-
tions, so there is no guarantee that the execution framework will take advantage of all opportunities for partial
aggregation.
3.1. LOCAL AGGREGATION 43
1: class Mapper
2: method Map(docid a, doc d)
3: H ← new AssociativeArray
4: for all term t ∈ doc d do
5: H¦t¦ ← H¦t¦ + 1 Tally counts for entire document
6: for all term t ∈ H do
7: Emit(term t, count H¦t¦)
Figure 3.2: Pseudo-code for the improved MapReduce word count algorithm that uses an
associative array to aggregate term counts on a per-document basis. Reducer is the same as in
Figure 3.1.
This basic idea can be taken one step further, as illustrated in the variant of the
word count algorithm in Figure 3.3 (once again, only the mapper is modified). The
workings of this algorithm critically depends on the details of how map and reduce
tasks in Hadoop are executed, discussed in Section 2.6. Recall, a (Java) mapper object
is created for each map task, which is responsible for processing a block of input key-
value pairs. Prior to processing any input key-value pairs, the mapper’s Initialize
method is called, which is an API hook for user-specified code. In this case, we initialize
an associative array for holding term counts. Since it is possible to preserve state across
multiple calls of the Map method (for each input key-value pair), we can continue
to accumulate partial term counts in the associative array across multiple documents,
and emit key-value pairs only when the mapper has processed all documents. That is,
emission of intermediate data is deferred until the Close method in the pseudo-code.
Recall that this API hook provides an opportunity to execute user-specified code after
the Map method has been applied to all input key-value pairs of the input data split
to which the map task was assigned.
With this technique, we are in essence incorporating combiner functionality di-
rectly inside the mapper. There is no need to run a separate combiner, since all op-
portunities for local aggregation are already exploited.
2
This is a sufficiently common
design pattern in MapReduce that it’s worth giving it a name, “in-mapper combining”,
so that we can refer to the pattern more conveniently throughout the book. We’ll see
later on how this pattern can be applied to a variety of problems. There are two main
advantages to using this design pattern:
First, it provides control over when local aggregation occurs and how it exactly
takes place. In contrast, the semantics of the combiner is underspecified in MapReduce.
2
Leaving aside the minor complication that in Hadoop, combiners can be run in the reduce phase also (when
merging intermediate key-value pairs from different map tasks). However, in practice it makes almost no
difference either way.
44 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Initialize
3: H ← new AssociativeArray
4: method Map(docid a, doc d)
5: for all term t ∈ doc d do
6: H¦t¦ ← H¦t¦ + 1 Tally counts across documents
7: method Close
8: for all term t ∈ H do
9: Emit(term t, count H¦t¦)
Figure 3.3: Pseudo-code for the improved MapReduce word count algorithm that demon-
strates the “in-mapper combining” design pattern. Reducer is the same as in Figure 3.1.
For example, Hadoop makes no guarantees on how many times the combiner is applied,
or that it is even applied at all. The combiner is provided as a semantics-preserving
optimization to the execution framework, which has the option of using it, perhaps
multiple times, or not at all (or even in the reduce phase). In some cases (although not
in this particular example), such indeterminism is unacceptable, which is exactly why
programmers often choose to perform their own local aggregation in the mappers.
Second, in-mapper combining will typically be more efficient than using actual
combiners. One reason for this is the additional overhead associated with actually ma-
terializing the key-value pairs. Combiners reduce the amount of intermediate data that
is shuffled across the network, but don’t actually reduce the number of key-value pairs
that are emitted by the mappers in the first place. With the algorithm in Figure 3.2,
intermediate key-value pairs are still generated on a per-document basis, only to be
“compacted” by the combiners. This process involves unnecessary object creation and
destruction (garbage collection takes time), and furthermore, object serialization and
deserialization (when intermediate key-value pairs fill the in-memory buffer holding map
outputs and need to be temporarily spilled to disk). In contrast, with in-mapper com-
bining, the mappers will generate only those key-value pairs that need to be shuffled
across the network to the reducers.
There are, however, drawbacks to the in-mapper combining pattern. First, it
breaks the functional programming underpinnings of MapReduce, since state is be-
ing preserved across multiple input key-value pairs. Ultimately, this isn’t a big deal,
since pragmatic concerns for efficiency often trump theoretical “purity”, but there are
practical consequences as well. Preserving state across multiple input instances means
that algorithmic behavior may depend on the order in which input key-value pairs are
encountered. This creates the potential for ordering-dependent bugs, which are difficult
to debug on large datasets in the general case (although the correctness of in-mapper
3.1. LOCAL AGGREGATION 45
combining for word count is easy to demonstrate). Second, there is a fundamental scala-
bility bottleneck associated with the in-mapper combining pattern. It critically depends
on having sufficient memory to store intermediate results until the mapper has com-
pletely processed all key-value pairs in an input split. In the word count example, the
memory footprint is bound by the vocabulary size, since it is theoretically possible that
a mapper encounters every term in the collection. Heap’s Law, a well-known result in
information retrieval, accurately models the growth of vocabulary size as a function
of the collection size—the somewhat surprising fact is that the vocabulary size never
stops growing.
3
Therefore, the algorithm in Figure 3.3 will scale only up to a point,
beyond which the associative array holding the partial term counts will no longer fit in
memory.
4
One common solution to limiting memory usage when using the in-mapper com-
bining technique is to “block” input key-value pairs and “flush” in-memory data struc-
tures periodically. The idea is simple: instead of emitting intermediate data only after
every key-value pair has been processed, emit partial results after processing every n
key-value pairs. This is straightforwardly implemented with a counter variable that
keeps track of the number of input key-value pairs that have been processed. As an
alternative, the mapper could keep track of its own memory footprint and flush inter-
mediate key-value pairs once memory usage has crossed a certain threshold. In both
approaches, either the block size or the memory usage threshold needs to be determined
empirically: with too large a value, the mapper may run out of memory, but with too
small a value, opportunities for local aggregation may be lost. Furthermore, in Hadoop
physical memory is split between multiple tasks that may be running on a node con-
currently; these tasks are all competing for finite resources, but since the tasks are not
aware of each other, it is difficult to coordinate resource consumption effectively. In
practice, however, one often encounters diminishing returns in performance gains with
increasing buffer sizes, such that it is not worth the effort to search for an optimal buffer
size (personal communication, Jeff Dean).
In MapReduce algorithms, the extent to which efficiency can be increased through
local aggregation depends on the size of the intermediate key space, the distribution of
keys themselves, and the number of key-value pairs that are emitted by each individual
map task. Opportunities for aggregation, after all, come from having multiple values
associated with the same key (whether one uses combiners or employs the in-mapper
combining pattern). In the word count example, local aggregation is effective because
3
In more detail, Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT
b
, where
T is the number of tokens in the collection. Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and
b ∼ 0.5 ([101], p. 81).
4
A few more details: note what matters is that the partial term counts encountered within particular input
split fits into memory. However, as collection sizes increase, one will often want to increase the input split size
to limit the growth of the number of map tasks (in order to reduce the number of distinct copy operations
necessary to shuffle intermediate data over the network).
46 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
many words are encountered multiple times within a map task. Local aggregation is also
an effective technique for dealing with reduce stragglers (see Section 2.3) that result
from a highly-skewed (e.g., Zipfian) distribution of values associated with intermediate
keys. In our word count example, we do not filter frequently-occurring words: therefore,
without local aggregation, the reducer that’s responsible for computing the count of
‘the’ will have a lot more work to do than the typical reducer, and therefore will likely
be a straggler. With local aggregation (either combiners or in-mapper combining), we
substantially reduce the number of values associated with frequently-occurring terms,
which alleviates the reduce straggler problem.
3.1.2 ALGORITHMIC CORRECTNESS WITH LOCAL AGGREGATION
Although use of combiners can yield dramatic reductions in algorithm running time,
care must be taken in applying them. Since combiners in Hadoop are viewed as op-
tional optimizations, the correctness of the algorithm cannot depend on computations
performed by the combiner or depend on them even being run at all. In any MapReduce
program, the reducer input key-value type must match the mapper output key-value
type: this implies that the combiner input and output key-value types must match the
mapper output key-value type (which is the same as the reducer input key-value type).
In cases where the reduce computation is both commutative and associative, the re-
ducer can also be used (unmodified) as the combiner (as is the case with the word count
example). In the general case, however, combiners and reducers are not interchangeable.
Consider a simple example: we have a large dataset where input keys are strings
and input values are integers, and we wish to compute the mean of all integers associated
with the same key (rounded to the nearest integer). A real-world example might be a
large user log from a popular website, where keys represent user ids and values represent
some measure of activity such as elapsed time for a particular session—the task would
correspond to computing the mean session length on a per-user basis, which would
be useful for understanding user demographics. Figure 3.4 shows the pseudo-code of
a simple algorithm for accomplishing this task that does not involve combiners. We
use an identity mapper, which simply passes all input key-value pairs to the reducers
(appropriately grouped and sorted). The reducer keeps track of the running sum and
the number of integers encountered. This information is used to compute the mean once
all values are processed. The mean is then emitted as the output value in the reducer
(with the input string as the key).
This algorithm will indeed work, but suffers from the same drawbacks as the
basic word count algorithm in Figure 3.1: it requires shuffling all key-value pairs from
mappers to reducers across the network, which is highly inefficient. Unlike in the word
count example, the reducer cannot be used as a combiner in this case. Consider what
would happen if we did: the combiner would compute the mean of an arbitrary subset
3.1. LOCAL AGGREGATION 47
1: class Mapper
2: method Map(string t, integer r)
3: Emit(string t, integer r)
1: class Reducer
2: method Reduce(string t, integers [r
1
, r
2
, . . .])
3: sum ← 0
4: cnt ← 0
5: for all integer r ∈ integers [r
1
, r
2
, . . .] do
6: sum ← sum + r
7: cnt ← cnt + 1
8: r
avg
← sum/cnt
9: Emit(string t, integer r
avg
)
Figure 3.4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values
associated with the same key.
of values associated with the same key, and the reducer would compute the mean of
those values. As a concrete example, we know that:
Mean(1, 2, 3, 4, 5) ,= Mean(Mean(1, 2), Mean(3, 4, 5))
In general, the mean of means of arbitrary subsets of a set of numbers is not the same
as the mean of the set of numbers. Therefore, this approach would not produce the
correct result.
5
So how might we properly take advantage of combiners? An attempt is shown in
Figure 3.5. The mapper remains the same, but we have added a combiner that partially
aggregates results by computing the numeric components necessary to arrive at the
mean. The combiner receives each string and the associated list of integer values, from
which it computes the sum of those values and the number of integers encountered (i.e.,
the count). The sum and count are packaged into a pair, and emitted as the output
of the combiner, with the same string as the key. In the reducer, pairs of partial sums
and counts can be aggregated to arrive at the mean. Up until now, all keys and values
in our algorithms have been primitives (string, integers, etc.). However, there are no
prohibitions in MapReduce for more complex types,
6
and, in fact, this represents a key
technique in MapReduce algorithm design that we introduced at the beginning of this
5
There is, however, one special case in which using reducers as combiners would produce the correct result: if
each combiner computed the mean of equal-size subsets of the values. However, since such fine-grained control
over the combiners is impossible in MapReduce, such a scenario is highly unlikely.
6
In Hadoop, either custom types or types defined using a library such as Protocol Buffers, Thrift, or Avro.
48 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Map(string t, integer r)
3: Emit(string t, integer r)
1: class Combiner
2: method Combine(string t, integers [r
1
, r
2
, . . .])
3: sum ← 0
4: cnt ← 0
5: for all integer r ∈ integers [r
1
, r
2
, . . .] do
6: sum ← sum + r
7: cnt ← cnt + 1
8: Emit(string t, pair (sum, cnt)) Separate sum and count
1: class Reducer
2: method Reduce(string t, pairs [(s
1
, c
1
), (s
2
, c
2
) . . .])
3: sum ← 0
4: cnt ← 0
5: for all pair (s, c) ∈ pairs [(s
1
, c
1
), (s
2
, c
2
) . . .] do
6: sum ← sum + s
7: cnt ← cnt + c
8: r
avg
← sum/cnt
9: Emit(string t, integer r
avg
)
Figure 3.5: Pseudo-code for an incorrect first attempt at introducing combiners to compute
the mean of values associated with each key. The mismatch between combiner input and output
key-value types violates the MapReduce programming model.
chapter. We will frequently encounter complex keys and values throughput the rest of
this book.
Unfortunately, this algorithm will not work. Recall that combiners must have the
same input and output key-value type, which also must be the same as the mapper
output type and the reducer input type. This is clearly not the case. To understand
why this restriction is necessary in the programming model, remember that combiners
are optimizations that cannot change the correctness of the algorithm. So let us remove
the combiner and see what happens: the output value type of the mapper is integer,
so the reducer expects to receive a list of integers as values. But the reducer actually
expects a list of pairs! The correctness of the algorithm is contingent on the combiner
running on the output of the mappers, and more specifically, that the combiner is run
exactly once. Recall from our previous discussion that Hadoop makes no guarantees on
3.1. LOCAL AGGREGATION 49
1: class Mapper
2: method Map(string t, integer r)
3: Emit(string t, pair (r, 1))
1: class Combiner
2: method Combine(string t, pairs [(s
1
, c
1
), (s
2
, c
2
) . . .])
3: sum ← 0
4: cnt ← 0
5: for all pair (s, c) ∈ pairs [(s
1
, c
1
), (s
2
, c
2
) . . .] do
6: sum ← sum + s
7: cnt ← cnt + c
8: Emit(string t, pair (sum, cnt))
1: class Reducer
2: method Reduce(string t, pairs [(s
1
, c
1
), (s
2
, c
2
) . . .])
3: sum ← 0
4: cnt ← 0
5: for all pair (s, c) ∈ pairs [(s
1
, c
1
), (s
2
, c
2
) . . .] do
6: sum ← sum + s
7: cnt ← cnt + c
8: r
avg
← sum/cnt
9: Emit(string t, integer r
avg
)
Figure 3.6: Pseudo-code for a MapReduce algorithm that computes the mean of values asso-
ciated with each key. This algorithm correctly takes advantage of combiners.
how many times combiners are called; it could be zero, one, or multiple times. This
violates the MapReduce programming model.
Another stab at the algorithm is shown in Figure 3.6, and this time, the algorithm
is correct. In the mapper we emit as the value a pair consisting of the integer and
one—this corresponds to a partial count over one instance. The combiner separately
aggregates the partial sums and the partial counts (as before), and emits pairs with
updated sums and counts. The reducer is similar to the combiner, except that the
mean is computed at the end. In essence, this algorithm transforms a non-associative
operation (mean of numbers) into an associative operation (element-wise sum of a pair
of numbers, with an additional division at the very end).
Let us verify the correctness of this algorithm by repeating the previous exercise:
What would happen if no combiners were run? With no combiners, the mappers would
send pairs (as values) directly to the reducers. There would be as many intermediate
pairs as there were input key-value pairs, and each of those would consist of an integer
50 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
1: class Mapper
2: method Initialize
3: S ← new AssociativeArray
4: C ← new AssociativeArray
5: method Map(string t, integer r)
6: S¦t¦ ← S¦t¦ + r
7: C¦t¦ ← C¦t¦ + 1
8: method Close
9: for all term t ∈ S do
10: Emit(term t, pair (S¦t¦, C¦t¦))
Figure 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values asso-
ciated with each key, illustrating the in-mapper combining design pattern. Only the mapper is
shown here; the reducer is the same as in Figure 3.6
and one. The reducer would still arrive at the correct sum and count, and hence the
mean would be correct. Now add in the combiners: the algorithm would remain correct,
no matter how many times they run, since the combiners merely aggregate partial sums
and counts to pass along to the reducers. Note that although the output key-value type
of the combiner must be the same as the input key-value type of the reducer, the reducer
can emit final key-value pairs of a different type.
Finally, in Figure 3.7, we present an even more efficient algorithm that exploits the
in-mapper combining pattern. Inside the mapper, the partial sums and counts associated
with each string are held in memory across input key-value pairs. Intermediate key-value
pairs are emitted only after the entire input split has been processed; similar to before,
the value is a pair consisting of the sum and count. The reducer is exactly the same as
in Figure 3.6. Moving partial aggregation from the combiner directly into the mapper
is subjected to all the tradeoffs and caveats discussed earlier this section, but in this
case the memory footprint of the data structures for holding intermediate data is likely
to be modest, making this variant algorithm an attractive option.
3.2 PAIRS AND STRIPES
One common approach for synchronization in MapReduce is to construct complex keys
and values in such a way that data necessary for a computation are naturally brought
together by the execution framework. We first touched on this technique in the previous
section, in the context of “packaging” partial sums and counts in a complex value
(i.e., pair) that is passed from mapper to combiner to reducer. Building on previously
3.2. PAIRS AND STRIPES 51
published work [54, 94], this section introduces two common design patterns we have
dubbed “pairs” and “stripes” that exemplify this strategy.
As a running example, we focus on the problem of building word co-occurrence
matrices from large corpora, a common task in corpus linguistics and statistical natural
language processing. Formally, the co-occurrence matrix of a corpus is a square n n
matrix where n is the number of unique words in the corpus (i.e., the vocabulary size). A
cell m
ij
contains the number of times word w
i
co-occurs with word w
j
within a specific
context—a natural unit such as a sentence, paragraph, or a document, or a certain
window of m words (where m is an application-dependent parameter). Note that the
upper and lower triangles of the matrix are identical since co-occurrence is a symmetric
relation, though in the general case relations between words need not be symmetric. For
example, a co-occurrence matrix M where m
ij
is the count of how many times word i
was immediately succeeded by word j would usually not be symmetric.
This task is quite common in text processing and provides the starting point to
many other algorithms, e.g., for computing statistics such as pointwise mutual infor-
mation [38], for unsupervised sense clustering [136], and more generally, a large body
of work in lexical semantics based on distributional profiles of words, dating back to
Firth [55] and Harris [69] in the 1950s and 1960s. The task also has applications in in-
formation retrieval (e.g., automatic thesaurus construction [137] and stemming [157]),
and other related fields such as text mining. More importantly, this problem represents
a specific instance of the task of estimating distributions of discrete joint events from a
large number of observations, a very common task in statistical natural language pro-
cessing for which there are nice MapReduce solutions. Indeed, concepts presented here
are also used in Chapter 6 when we discuss expectation-maximization algorithms.
Beyond text processing, problems in many application domains share similar char-
acteristics. For example, a large retailer might analyze point-of-sale transaction records
to identify correlated product purchases (e.g., customers who buy this tend to also buy
that), which would assist in inventory management and product placement on store
shelves. Similarly, an intelligence analyst might wish to identify associations between
re-occurring financial transactions that are otherwise unrelated, which might provide a
clue in thwarting terrorist activity. The algorithms discussed in this section could be
adapted to tackle these related problems.
It is obvious that the space requirement for the word co-occurrence problem is
O(n
2
), where n is the size of the vocabulary, which for real-world English corpora can
be hundreds of thousands of words, or even billions of words in web-scale collections.
7
The computation of the word co-occurrence matrix is quite simple if the entire matrix
7
The size of the vocabulary depends on the definition of a “word” and techniques (if any) for corpus pre-
processing. One common strategy is to replace all rare words (below a certain frequency) with a “special”
token such as <UNK> (which stands for “unknown”) to model out-of-vocabulary words. Another technique
involves replacing numeric digits with #, such that 1.32 and 1.19 both map to the same token (#.##).
52 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
fits into memory—however, in the case where the matrix is too big to fit in memory,
a na¨ıve implementation on a single machine can be very slow as memory is paged to
disk. Although compression techniques can increase the size of corpora for which word
co-occurrence matrices can be constructed on a single machine, it is clear that there are
inherent scalability limitations. We describe two MapReduce algorithms for this task
that can scale to large corpora.
Pseudo-code for the first algorithm, dubbed the “pairs” approach, is shown in
Figure 3.8. As usual, document ids and the corresponding contents make up the input
key-value pairs. The mapper processes each input document and emits intermediate
key-value pairs with each co-occurring word pair as the key and the integer one (i.e.,
the count) as the value. This is straightforwardly accomplished by two nested loops:
the outer loop iterates over all words (the left element in the pair), and the inner
loop iterates over all neighbors of the first word (the right element in the pair). The
neighbors of a word can either be defined in terms of a sliding window or some other
contextual unit such as a sentence. The MapReduce execution framework guarantees
that all values associated with the same key are brought together in the reducer. Thus,
in this case the reducer simply sums up all the values associated with the same co-
occurring word pair to arrive at the absolute count of the joint event in the corpus,
which is then emitted as the final key-value pair. Each pair corresponds to a cell in the
word co-occurrence matrix. This algorithm illustrates the use of complex keys in order
to coordinate distributed computations.
An alternative approach, dubbed the “stripes” approach, is presented in Fig-
ure 3.9. Like the pairs approach, co-occurring word pairs are generated by two nested
loops. However, the major difference is that instead of emitting intermediate key-value
pairs for each co-occurring word pair, co-occurrence information is first stored in an
associative array, denoted H. The mapper emits key-value pairs with words as keys
and corresponding associative arrays as values, where each associative array encodes
the co-occurrence counts of the neighbors of a particular word (i.e., its context). The
MapReduce execution framework guarantees that all associative arrays with the same
key will be brought together in the reduce phase of processing. The reducer performs an
element-wise sum of all associative arrays with the same key, accumulating counts that
correspond to the same cell in the co-occurrence matrix. The final associative array is
emitted with the same word as the key. In contrast to the pairs approach, each final
key-value pair encodes a row in the co-occurrence matrix.
It is immediately obvious that the pairs algorithm generates an immense number
of key-value pairs compared to the stripes approach. The stripes representation is much
more compact, since with pairs the left element is repeated for every co-occurring word
pair. The stripes approach also generates fewer and shorter intermediate keys, and
therefore the execution framework has less sorting to perform. However, values in the
3.2. PAIRS AND STRIPES 53
1: class Mapper
2: method Map(docid a, doc d)
3: for all term w ∈ doc d do
4: for all term u ∈ Neighbors(w) do
5: Emit(pair (w, u), count 1) Emit count for each co-occurrence
1: class Reducer
2: method Reduce(pair p, counts [c
1
, c
2
, . . .])
3: s ← 0
4: for all count c ∈ counts [c
1
, c
2
, . . .] do
5: s ← s + c Sum co-occurrence counts
6: Emit(pair p, count s)
Figure 3.8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices
from large corpora.
1: class Mapper
2: method Map(docid a, doc d)
3: for all term w ∈ doc d do
4: H ← new AssociativeArray
5: for all term u ∈ Neighbors(w) do
6: H¦u¦ ← H¦u¦ + 1 Tally words co-occurring with w
7: Emit(Term w, Stripe H)
1: class Reducer
2: method Reduce(term w, stripes [H
1
, H
2
, H
3
, . . .])
3: H
f
← new AssociativeArray
4: for all stripe H ∈ stripes [H
1
, H
2
, H
3
, . . .] do
5: Sum(H
f
, H) Element-wise sum
6: Emit(term w, stripe H
f
)
Figure 3.9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices
from large corpora.
54 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
stripes approach are more complex, and come with more serialization and deserialization
overhead than with the pairs approach.
Both algorithms can benefit from the use of combiners, since the respective oper-
ations in their reducers (addition and element-wise sum of associative arrays) are both
commutative and associative. However, combiners with the stripes approach have more
opportunities to perform local aggregation because the key space is the vocabulary—
associative arrays can be merged whenever a word is encountered multiple times by
a mapper. In contrast, the key space in the pairs approach is the cross of the vocab-
ulary with itself, which is far larger—counts can be aggregated only when the same
co-occurring word pair is observed multiple times by an individual mapper (which is
less likely than observing multiple occurrences of a word, as in the stripes case).
For both algorithms, the in-mapper combining optimization discussed in the pre-
vious section can also be applied; the modification is sufficiently straightforward that
we leave the implementation as an exercise for the reader. However, the above caveats
remain: there will be far fewer opportunities for partial aggregation in the pairs ap-
proach due to the sparsity of the intermediate key space. The sparsity of the key space
also limits the effectiveness of in-memory combining, since the mapper may run out of
memory to store partial counts before all documents are processed, necessitating some
mechanism to periodically emit key-value pairs (which further limits opportunities to
perform partial aggregation). Similarly, for the stripes approach, memory management
will also be more complex than in the simple word count example. For common terms,
the associative array may grow to be quite large, necessitating some mechanism to
periodically flush in-memory structures.
It is important to consider potential scalability bottlenecks of either algorithm.
The stripes approach makes the assumption that, at any point in time, each associative
array is small enough to fit into memory—otherwise, memory paging will significantly
impact performance. The size of the associative array is bounded by the vocabulary size,
which is itself unbounded with respect to corpus size (recall the previous discussion of
Heap’s Law). Therefore, as the sizes of corpora increase, this will become an increasingly
pressing issue—perhaps not for gigabyte-sized corpora, but certainly for terabyte-sized
and petabyte-sized corpora that will be commonplace tomorrow. The pairs approach,
on the other hand, does not suffer from this limitation, since it does not need to hold
intermediate data in memory.
Given this discussion, which approach is faster? Here, we present previously-
published results [94] that empirically answered this question. We have implemented
both algorithms in Hadoop and applied them to a corpus of 2.27 million documents
from the Associated Press Worldstream (APW) totaling 5.7 GB.
8
Prior to working
8
This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium
(LDC catalog number LDC2007T07).
3.2. PAIRS AND STRIPES 55
with Hadoop, the corpus was first preprocessed as follows: All XML markup was re-
moved, followed by tokenization and stopword removal using standard tools from the
Lucene search engine. All tokens were then replaced with unique integers for a more
efficient encoding. Figure 3.10 compares the running time of the pairs and stripes ap-
proach on different fractions of the corpus, with a co-occurrence window size of two.
These experiments were performed on a Hadoop cluster with 19 slave nodes, each with
two single-core processors and two disks.
Results demonstrate that the stripes approach is much faster than the pairs ap-
proach: 666 seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the
entire corpus (improvement by a factor of 5.7). The mappers in the pairs approach gen-
erated 2.6 billion intermediate key-value pairs totaling 31.2 GB. After the combiners,
this was reduced to 1.1 billion key-value pairs, which quantifies the amount of interme-
diate data transferred across the network. In the end, the reducers emitted a total of 142
million final key-value pairs (the number of non-zero cells in the co-occurrence matrix).
On the other hand, the mappers in the stripes approach generated 653 million interme-
diate key-value pairs totaling 48.1 GB. After the combiners, only 28.8 million key-value
pairs remained. The reducers emitted a total of 1.69 million final key-value pairs (the
number of rows in the co-occurrence matrix). As expected, the stripes approach pro-
vided more opportunities for combiners to aggregate intermediate results, thus greatly
reducing network traffic in the shuffle and sort phase. Figure 3.10 also shows that both
algorithms exhibit highly desirable scaling characteristics—linear in the amount of in-
put data. This is confirmed by a linear regression applied to the running time data,
which yields an R
2
value close to one.
An additional series of experiments explored the scalability of the stripes approach
along another dimension: the size of the cluster. These experiments were made possible
by Amazon’s EC2 service, which allows users to rapidly provision clusters of varying
sizes for limited durations (for more information, refer back to our discussion of utility
computing in Section 1.1). Virtualized computational units in EC2 are called instances,
and the user is charged only for the instance-hours consumed. Figure 3.11 (left) shows
the running time of the stripes algorithm (on the same corpus, with same setup as
before), on varying cluster sizes, from 20 slave “small” instances all the way up to 80
slave “small” instances (along the x-axis). Running times are shown with solid squares.
Figure 3.11 (right) recasts the same results to illustrate scaling characteristics. The
circles plot the relative size and speedup of the EC2 experiments, with respect to the
20-instance cluster. These results show highly desirable linear scaling characteristics
(i.e., doubling the cluster size makes the job twice as fast). This is confirmed by a linear
regression with an R
2
value close to one.
Viewed abstractly, the pairs and stripes algorithms represent two different ap-
proaches to counting co-occurring events from a large number of observations. This
56 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100
R
u
n
n
i
n
g

t
i
m
e

(
s
e
c
o
n
d
s
)
Percentage of the APW corpus
R
2
= 0.992
R
2
= 0.999
"stripes" approach
"pairs" approach
Figure 3.10: Running time of the “pairs” and “stripes” algorithms for computing word co-
occurrence matrices on different fractions of the APW corpus. These experiments were per-
formed on a Hadoop cluster with 19 slaves, each with two single-core processors and two disks.
0
1000
2000
3000
4000
5000
10 20 30 40 50 60 70 80 90
R
u
n
n
i
n
g

t
i
m
e

(
s
e
c
o
n
d
s
)
Size of EC2 cluster (number of slave instances)
1x
2x
3x
4x
1x 2x 3x 4x
R
e
l
a
t
i
v
e

s
p
e
e
d
u
p
Relative size of EC2 cluster
R
2
= 0.997
Figure 3.11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters
of different sizes from EC2 (left). Scaling characteristics (relative speedup) in terms of increasing
Hadoop cluster size (right).
3.3. COMPUTING RELATIVE FREQUENCIES 57
general description captures the gist of many algorithms in fields as diverse as text
processing, data mining, and bioinformatics. For this reason, these two design patterns
are broadly useful and frequently observed in a variety of applications.
To conclude, it is worth noting that the pairs and stripes approaches represent
endpoints along a continuum of possibilities. The pairs approach individually records
each co-occurring event, while the stripes approach records all co-occurring events
with respect a conditioning event. A middle ground might be to record a subset of
the co-occurring events with respect to a conditioning event. We might divide up the
entire vocabulary into b buckets (e.g., via hashing), so that words co-occurring with
w
i
would be divided into b smaller “sub-stripes”, associated with ten separate keys,
(w
i
, 1), (w
i
, 2) . . . (w
i
, b). This would be a reasonable solution to the memory limitations
of the stripes approach, since each of the sub-stripes would be smaller. In the case of
b = [V [, where [V [ is the vocabulary size, this is equivalent to the pairs approach. In
the case of b = 1, this is equivalent to the standard stripes approach.
3.3 COMPUTING RELATIVE FREQUENCIES
Let us build on the pairs and stripes algorithms presented in the previous section and
continue with our running example of constructing the word co-occurrence matrix M
for a large corpus. Recall that in this large square n n matrix, where n = [V [ (the
vocabulary size), cell m
ij
contains the number of times word w
i
co-occurs with word
w
j
within a specific context. The drawback of absolute counts is that it doesn’t take
into account the fact that some words appear more frequently than others. Word w
i
may co-occur frequently with w
j
simply because one of the words is very common. A
simple remedy is to convert absolute counts into relative frequencies, f(w
j
[w
i
). That is,
what proportion of the time does w
j
appear in the context of w
i
? This can be computed
using the following equation:
f(w
j
[w
i
) =
N(w
i
, w
j
)

w
N(w
i
, w
t
)
(3.1)
Here, N(, ) indicates the number of times a particular co-occurring word pair is ob-
served in the corpus. We need the count of the joint event (word co-occurrence), divided
by what is known as the marginal (the sum of the counts of the conditioning variable
co-occurring with anything else).
Computing relative frequencies with the stripes approach is straightforward. In
the reducer, counts of all words that co-occur with the conditioning variable (w
i
in the
above example) are available in the associative array. Therefore, it suffices to sum all
those counts to arrive at the marginal (i.e.,

w
N(w
i
, w
t
)), and then divide all the joint
counts by the marginal to arrive at the relative frequency for all words. This implemen-
tation requires minimal modification to the original stripes algorithm in Figure 3.9, and
58 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
illustrates the use of complex data structures to coordinate distributed computations
in MapReduce. Through appropriate structuring of keys and values, one can use the
MapReduce execution framework to bring together all the pieces of data required to
perform a computation. Note that, as with before, this algorithm also assumes that
each associative array fits into memory.
How might one compute relative frequencies with the pairs approach? In the pairs
approach, the reducer receives (w
i
, w
j
) as the key and the count as the value. From
this alone it is not possible to compute f(w
j
[w
i
) since we do not have the marginal.
Fortunately, as in the mapper, the reducer can preserve state across multiple keys.
Inside the reducer, we can buffer in memory all the words that co-occur with w
i
and
their counts, in essence building the associative array in the stripes approach. To make
this work, we must define the sort order of the pair so that keys are first sorted by the left
word, and then by the right word. Given this ordering, we can easily detect if all pairs
associated with the word we are conditioning on (w
i
) have been encountered. At that
point we can go back through the in-memory buffer, compute the relative frequencies,
and then emit those results in the final key-value pairs.
There is one more modification necessary to make this algorithm work. We must
ensure that all pairs with the same left word are sent to the same reducer. This, unfor-
tunately, does not happen automatically: recall that the default partitioner is based on
the hash value of the intermediate key, modulo the number of reducers. For a complex
key, the raw byte representation is used to compute the hash value. As a result, there
is no guarantee that, for example, (dog, aardvark) and (dog, zebra) are assigned to the
same reducer. To produce the desired behavior, we must define a custom partitioner
that only pays attention to the left word. That is, the partitioner should partition based
on the hash of the left word only.
This algorithm will indeed work, but it suffers from the same drawback as the
stripes approach: as the size of the corpus grows, so does that vocabulary size, and at
some point there will not be sufficient memory to store all co-occurring words and their
counts for the word we are conditioning on. For computing the co-occurrence matrix, the
advantage of the pairs approach is that it doesn’t suffer from any memory bottlenecks.
Is there a way to modify the basic pairs approach so that this advantage is retained?
As it turns out, such an algorithm is indeed possible, although it requires the co-
ordination of several mechanisms in MapReduce. The insight lies in properly sequencing
data presented to the reducer. If it were possible to somehow compute (or otherwise
obtain access to) the marginal in the reducer before processing the joint counts, the
reducer could simply divide the joint counts by the marginal to compute the relative
frequencies. The notion of “before” and “after” can be captured in the ordering of
key-value pairs, which can be explicitly controlled by the programmer. That is, the
programmer can define the sort order of keys so that data needed earlier is presented
3.3. COMPUTING RELATIVE FREQUENCIES 59
key values
(dog, ∗) [6327, 8514, . . .] compute marginal:

w
N(dog, w
t
) = 42908
(dog, aardvark) [2,1] f(aardvark[dog) = 3/42908
(dog, aardwolf) [1] f(aardwolf[dog) = 1/42908
. . .
(dog, zebra) [2,1,1,1] f(zebra[dog) = 5/42908
(doge, ∗) [682, . . .] compute marginal:

w
N(doge, w
t
) = 1267
. . .
Figure 3.12: Example of the sequence of key-value pairs presented to the reducer in the pairs
algorithm for computing relative frequencies. This illustrates the application of the order inver-
sion design pattern.
to the reducer before data that is needed later. However, we still need to compute the
marginal counts. Recall that in the basic pairs algorithm, each mapper emits a key-
value pair with the co-occurring word pair as the key. To compute relative frequencies,
we modify the mapper so that it additionally emits a “special” key of the form (w
i
, ∗),
with a value of one, that represents the contribution of the word pair to the marginal.
Through use of combiners, these partial marginal counts will be aggregated before be-
ing sent to the reducers. Alternatively, the in-mapper combining pattern can be used
to even more efficiently aggregate marginal counts.
In the reducer, we must make sure that the special key-value pairs representing
the partial marginal contributions are processed before the normal key-value pairs rep-
resenting the joint counts. This is accomplished by defining the sort order of the keys
so that pairs with the special symbol of the form (w
i
, ∗) are ordered before any other
key-value pairs where the left word is w
i
. In addition, as with before we must also prop-
erly define the partitioner to pay attention to only the left word in each pair. With the
data properly sequenced, the reducer can directly compute the relative frequencies.
A concrete example is shown in Figure 3.12, which lists the sequence of key-value
pairs that a reducer might encounter. First, the reducer is presented with the special key
(dog, ∗) and a number of values, each of which represents a partial marginal contribution
from the map phase (assume here either combiners or in-mapper combining, so the
values represent partially aggregated counts). The reducer accumulates these counts to
arrive at the marginal,

w
N(dog, w
t
). The reducer holds on to this value as it processes
subsequent keys. After (dog, ∗), the reducer will encounter a series of keys representing
joint counts; let’s say the first of these is the key (dog, aardvark). Associated with this
key will be a list of values representing partial joint counts from the map phase (two
separate values in this case). Summing these counts will yield the final joint count, i.e.,
the number of times dog and aardvark co-occur in the entire collection. At this point,
60 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
since the reducer already knows the marginal, simple arithmetic suffices to compute
the relative frequency. All subsequent joint counts are processed in exactly the same
manner. When the reducer encounters the next special key-value pair (doge, ∗), the
reducer resets its internal state and starts to accumulate the marginal all over again.
Observe that the memory requirement for this algorithm is minimal, since only the
marginal (an integer) needs to be stored. No buffering of individual co-occurring word
counts is necessary, and therefore we have eliminated the scalability bottleneck of the
previous algorithm.
This design pattern, which we call “order inversion”, occurs surprisingly often
and across applications in many domains. It is so named because through proper co-
ordination, we can access the result of a computation in the reducer (for example, an
aggregate statistic) before processing the data needed for that computation. The key
insight is to convert the sequencing of computations into a sorting problem. In most
cases, an algorithm requires data in some fixed order: by controlling how keys are sorted
and how the key space is partitioned, we can present data to the reducer in the order
necessary to perform the proper computations. This greatly cuts down on the amount
of partial results that the reducer needs to hold in memory.
To summarize, the specific application of the order inversion design pattern for
computing relative frequencies requires the following:
• Emitting a special key-value pair for each co-occurring word pair in the mapper
to capture its contribution to the marginal.
• Controlling the sort order of the intermediate key so that the key-value pairs
representing the marginal contributions are processed by the reducer before any
of the pairs representing the joint word co-occurrence counts.
• Defining a custom partitioner to ensure that all pairs with the same left word are
shuffled to the same reducer.
• Preserving state across multiple keys in the reducer to first compute the marginal
based on the special key-value pairs and then dividing the joint counts by the
marginals to arrive at the relative frequencies.
As we will see in Chapter 4, this design pattern is also used in inverted index construc-
tion to properly set compression parameters for postings lists.
3.4 SECONDARY SORTING
MapReduce sorts intermediate key-value pairs by the keys during the shuffle and sort
phase, which is very convenient if computations inside the reducer rely on sort order
(e.g., the order inversion design pattern described in the previous section). However,
3.4. SECONDARY SORTING 61
what if in addition to sorting by key, we also need to sort by value? Google’s MapReduce
implementation provides built-in functionality for (optional) secondary sorting, which
guarantees that values arrive in sorted order. Hadoop, unfortunately, does not have this
capability built in.
Consider the example of sensor data from a scientific experiment: there are m
sensors each taking readings on continuous basis, where m is potentially a large number.
A dump of the sensor data might look something like the following, where r
x
after each
timestamp represents the actual sensor readings (unimportant for this discussion, but
may be a series of values, one or more complex records, or even raw bytes of images).
(t
1
, m
1
, r
80521
)
(t
1
, m
2
, r
14209
)
(t
1
, m
3
, r
76042
)
...
(t
2
, m
1
, r
21823
)
(t
2
, m
2
, r
66508
)
(t
2
, m
3
, r
98347
)
Suppose we wish to reconstruct the activity at each individual sensor over time. A
MapReduce program to accomplish this might map over the raw data and emit the
sensor id as the intermediate key, with the rest of each record as the value:
m
1
→ (t
1
, r
80521
)
This would bring all readings from the same sensor together in the reducer. However,
since MapReduce makes no guarantees about the ordering of values associated with the
same key, the sensor readings will not likely be in temporal order. The most obvious
solution is to buffer all the readings in memory and then sort by timestamp before
additional processing. However, it should be apparent by now that any in-memory
buffering of data introduces a potential scalability bottleneck. What if we are working
with a high frequency sensor or sensor readings over a long period of time? What if the
sensor readings themselves are large complex objects? This approach may not scale in
these cases—the reducer would run out of memory trying to buffer all values associated
with the same key.
This is a common problem, since in many applications we wish to first group
together data one way (e.g., by sensor id), and then sort within the groupings another
way (e.g., by time). Fortunately, there is a general purpose solution, which we call the
“value-to-key conversion” design pattern. The basic idea is to move part of the value
into the intermediate key to form a composite key, and let the MapReduce execution
framework handle the sorting. In the above example, instead of emitting the sensor id
as the key, we would emit the sensor id and the timestamp as a composite key:
(m
1
, t
1
) → (r
80521
)
62 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
The sensor reading itself now occupies the value. We must define the intermediate key
sort order to first sort by the sensor id (the left element in the pair) and then by the
timestamp (the right element in the pair). We must also implement a custom partitioner
so that all pairs associated with the same sensor are shuffled to the same reducer.
Properly orchestrated, the key-value pairs will be presented to the reducer in the
correct sorted order:
(m
1
, t
1
) → [(r
80521
)]
(m
1
, t
2
) → [(r
21823
)]
(m
1
, t
3
) → [(r
146925
)]
. . .
However, note that sensor readings are now split across multiple keys. The reducer will
need to preserve state and keep track of when readings associated with the current
sensor end and the next sensor begin.
9
The basic tradeoff between the two approaches discussed above (buffer and in-
memory sort vs. value-to-key conversion) is where sorting is performed. One can explic-
itly implement secondary sorting in the reducer, which is likely to be faster but suffers
from a scalability bottleneck.
10
With value-to-key conversion, sorting is offloaded to the
MapReduce execution framework. Note that this approach can be arbitrarily extended
to tertiary, quaternary, etc. sorting. This pattern results in many more keys for the
framework to sort, but distributed sorting is a task that the MapReduce runtime excels
at since it lies at the heart of the programming model.
3.5 RELATIONAL JOINS
One popular application of Hadoop is data-warehousing. In an enterprise setting, a data
warehouse serves as a vast repository of data, holding everything from sales transac-
tions to product inventories. Typically, the data is relational in nature, but increasingly
data warehouses are used to store semi-structured data (e.g., query logs) as well as
unstructured data. Data warehouses form a foundation for business intelligence appli-
cations designed to provide decision support. It is widely believed that insights gained
by mining historical, current, and prospective data can yield competitive advantages in
the marketplace.
Traditionally, data warehouses have been implemented through relational
databases, particularly those optimized for a specific workload known as online analyt-
ical processing (OLAP). A number of vendors offer parallel databases, but customers
9
Alternatively, Hadoop provides API hooks to define “groups” of intermediate keys that should be processed
together in the reducer.
10
Note that, in principle, this need not be an in-memory sort. It is entirely possible to implement a disk-based sort
within the reducer, although one would be duplicating functionality that is already present in the MapReduce
execution framework. It makes more sense to take advantage of functionality that is already present with
value-to-key conversion.
3.5. RELATIONAL JOINS 63
find that they often cannot cost-effectively scale to the crushing amounts of data an
organization needs to deal with today. Parallel databases are often quite expensive—
on the order of tens of thousands of dollars per terabyte of user data. Over the past
few years, Hadoop has gained popularity as a platform for data-warehousing. Ham-
merbacher [68], for example, discussed Facebook’s experiences with scaling up business
intelligence applications with Oracle databases, which they ultimately abandoned in
favor of a Hadoop-based solution developed in-house called Hive (which is now an
open-source project). Pig [114] is a platform for massive data analytics built on Hadoop
and capable of handling structured as well as semi-structured data. It was originally
developed by Yahoo, but is now also an open-source project.
Given successful applications of Hadoop to data-warehousing and complex ana-
lytical queries that are prevalent in such an environment, it makes sense to examine
MapReduce algorithms for manipulating relational data. This section focuses specif-
ically on performing relational joins in MapReduce. We should stress here that even
though Hadoop has been applied to process relational data, Hadoop is not a database.
There is an ongoing debate between advocates of parallel databases and proponents
of MapReduce regarding the merits of both approaches for OLAP-type workloads. De-
witt and Stonebraker, two well-known figures in the database community, famously
decried MapReduce as “a major step backwards” in a controversial blog post.
11
With
colleagues, they ran a series of benchmarks that demonstrated the supposed superiority
of column-oriented parallel databases over Hadoop [120, 144]. However, see Dean and
Ghemawat’s counterarguments [47] and recent attempts at hybrid architectures [1].
We shall refrain here from participating in this lively debate, and instead focus on
discussing algorithms. From an application point of view, it is highly unlikely that an
analyst interacting with a data warehouse will ever be called upon to write MapReduce
programs (and indeed, Hadoop-based systems such as Hive and Pig present a much
higher-level language for interacting with large amounts of data). Nevertheless, it is
instructive to understand the algorithms that underlie basic relational operations.
This section presents three different strategies for performing relational joins on
two datasets (relations), generically named S and T. Let us suppose that relation S
looks something like the following:
(k
1
, s
1
, S
1
)
(k
2
, s
2
, S
2
)
(k
3
, s
3
, S
3
)
. . .
where k is the key we would like to join on, s
n
is a unique id for the tuple, and the
S
n
after s
n
denotes other attributes in the tuple (unimportant for the purposes of the
join). Similarly, suppose relation T looks something like this:
11
http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/
64 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
(k
1
, t
1
, T
1
)
(k
3
, t
2
, T
2
)
(k
8
, t
3
, T
3
)
. . .
where k is the join key, t
n
is a unique id for the tuple, and the T
n
after t
n
denotes other
attributes in the tuple.
To make this task more concrete, we present one realistic scenario: S might rep-
resent a collection of user profiles, in which case k could be interpreted as the primary
key (i.e., user id). The tuples might contain demographic information such as age, gen-
der, income, etc. The other dataset, T, might represent logs of online activity. Each
tuple might correspond to a page view of a particular URL and may contain additional
information such as time spent on the page, ad revenue generated, etc. The k in these
tuples could be interpreted as the foreign key that associates each individual page view
with a user. Joining these two datasets would allow an analyst, for example, to break
down online activity in terms of demographics.
3.5.1 REDUCE-SIDE JOIN
The first approach to relational joins is what’s known as a reduce-side join. The idea
is quite simple: we map over both datasets and emit the join key as the intermediate
key, and the tuple itself as the intermediate value. Since MapReduce guarantees that
all values with the same key are brought together, all tuples will be grouped by the
join key—which is exactly what we need to perform the join operation. This approach
is known as a parallel sort-merge join in the database community [134]. In more detail,
there are three different cases to consider.
The first and simplest is a one-to-one join, where at most one tuple from S and
one tuple from T share the same join key (but it may be the case that no tuple from
S shares the join key with a tuple from T, or vice versa). In this case, the algorithm
sketched above will work fine. The reducer will be presented keys and lists of values
along the lines of the following:
k
23
→ [(s
64
, S
64
), (t
84
, T
84
)]
k
37
→ [(s
68
, S
68
)]
k
59
→ [(t
97
, T
97
), (s
81
, S
81
)]
k
61
→ [(t
99
, T
99
)]
. . .
Since we’ve emitted the join key as the intermediate key, we can remove it from the
value to save a bit of space.
12
If there are two values associated with a key, then we know
that one must be from S and the other must be from T. However, recall that in the
12
Not very important if the intermediate data is compressed.
3.5. RELATIONAL JOINS 65
basic MapReduce programming model, no guarantees are made about value ordering,
so the first value might be from S or from T. We can proceed to join the two tuples
and perform additional computations (e.g., filter by some other attribute, compute
aggregates, etc.). If there is only one value associated with a key, this means that no
tuple in the other dataset shares the join key, so the reducer does nothing.
Let us now consider the one-to-many join. Assume that tuples in S have unique
join keys (i.e., k is the primary key in S), so that S is the “one” and T is the “many”.
The above algorithm will still work, but when processing each key in the reducer, we
have no idea when the value corresponding to the tuple from S will be encountered, since
values are arbitrarily ordered. The easiest solution is to buffer all values in memory,
pick out the tuple from S, and then cross it with every tuple from T to perform the
join. However, as we have seen several times already, this creates a scalability bottleneck
since we may not have sufficient memory to hold all the tuples with the same join key.
This is a problem that requires a secondary sort, and the solution lies in the
value-to-key conversion design pattern we just presented. In the mapper, instead of
simply emitting the join key as the intermediate key, we instead create a composite key
consisting of the join key and the tuple id (from either S or T). Two additional changes
are required: First, we must define the sort order of the keys to first sort by the join
key, and then sort all tuple ids from S before all tuple ids from T. Second, we must
define the partitioner to pay attention to only the join key, so that all composite keys
with the same join key arrive at the same reducer.
After applying the value-to-key conversion design pattern, the reducer will be
presented with keys and values along the lines of the following:
(k
82
, s
105
) → [(S
105
)]
(k
82
, t
98
) → [(T
98
)]
(k
82
, t
101
) → [(T
101
)]
(k
82
, t
137
) → [(T
137
)]
. . .
Since both the join key and the tuple id are present in the intermediate key, we can
remove them from the value to save a bit of space.
13
Whenever the reducer encounters
a new join key, it is guaranteed that the associated value will be the relevant tuple from
S. The reducer can hold this tuple in memory and then proceed to cross it with tuples
from T in subsequent steps (until a new join key is encountered). Since the MapReduce
execution framework performs the sorting, there is no need to buffer tuples (other than
the single one from S). Thus, we have eliminated the scalability bottleneck.
Finally, let us consider the many-to-many join case. Assuming that S is the smaller
dataset, the above algorithm works as well. Consider what happens at the reducer:
13
Once again, not very important if the intermediate data is compressed.
66 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
(k
82
, s
105
) → [(S
105
)]
(k
82
, s
124
) → [(S
124
)]
. . .
(k
82
, t
98
) → [(T
98
)]
(k
82
, t
101
) → [(T
101
)]
(k
82
, t
137
) → [(T
137
)]
. . .
All the tuples from S with the same join key will be encountered first, which the reducer
can buffer in memory. As the reducer processes each tuple from T, it is crossed with all
the tuples from S. Of course, we are assuming that the tuples from S (with the same
join key) will fit into memory, which is a limitation of this algorithm (and why we want
to control the sort order so that the smaller dataset comes first).
The basic idea behind the reduce-side join is to repartition the two datasets by
the join key. The approach isn’t particularly efficient since it requires shuffling both
datasets across the network. This leads us to the map-side join.
3.5.2 MAP-SIDE JOIN
Suppose we have two datasets that are both sorted by the join key. We can perform a
join by scanning through both datasets simultaneously—this is known as a merge join
in the database community. We can parallelize this by partitioning and sorting both
datasets in the same way. For example, suppose S and T were both divided into ten
files, partitioned in the same manner by the join key. Further suppose that in each file,
the tuples were sorted by the join key. In this case, we simply need to merge join the
first file of S with the first file of T, the second file with S with the second file of T, etc.
This can be accomplished in parallel, in the map phase of a MapReduce job—hence, a
map-side join. In practice, we map over one of the datasets (the larger one) and inside
the mapper read the corresponding part of the other dataset to perform the merge
join.
14
No reducer is required, unless the programmer wishes to repartition the output
or perform further processing.
A map-side join is far more efficient than a reduce-side join since there is no need
to shuffle the datasets over the network. But is it realistic to expect that the stringent
conditions required for map-side joins are satisfied? In many cases, yes. The reason
is that relational joins happen within the broader context of a workflow, which may
include multiple steps. Therefore, the datasets that are to be joined may be the output
of previous processes (either MapReduce jobs or other code). If the workflow is known
in advance and relatively static (both reasonable assumptions in a mature workflow),
we can engineer the previous processes to generate output sorted and partitioned in
a way that makes efficient map-side joins possible (in MapReduce, by using a custom
14
Note that this almost always implies a non-local read.
3.5. RELATIONAL JOINS 67
partitioner and controlling the sort order of key-value pairs). For ad hoc data analysis,
reduce-side joins are a more general, albeit less efficient, solution. Consider the case
where datasets have multiple keys that one might wish to join on—then no matter
how the data is organized, map-side joins will require repartitioning of the data. Al-
ternatively, it is always possible to repartition a dataset using an identity mapper and
reducer. But of course, this incurs the cost of shuffling data over the network.
There is a final restriction to bear in mind when using map-side joins with the
Hadoop implementation of MapReduce. We assume here that the datasets to be joined
were produced by previous MapReduce jobs, so this restriction applies to keys the
reducers in those jobs may emit. Hadoop permits reducers to emit keys that are different
from the input key whose values they are processing (that is, input and output keys
need not be the same, nor even the same type).
15
However, if the output key of a
reducer is different from the input key, then the output dataset from the reducer will
not necessarily be partitioned in a manner consistent with the specified partitioner
(because the partitioner applies to the input keys rather than the output keys). Since
map-side joins depend on consistent partitioning and sorting of keys, the reducers used
to generate data that will participate in a later map-side join must not emit any key
but the one they are currently processing.
3.5.3 MEMORY-BACKED JOIN
In addition to the two previous approaches to joining relational data that leverage the
MapReduce framework to bring together tuples that share a common join key, there is a
family of approaches we call memory-backed joins based on random access probes. The
simplest version is applicable when one of the two datasets completely fits in memory
on each node. In this situation, we can load the smaller dataset into memory in every
mapper, populating an associative array to facilitate random access to tuples based on
the join key. The mapper initialization API hook (see Section 3.1.1) can be used for
this purpose. Mappers are then applied to the other (larger) dataset, and for each input
key-value pair, the mapper probes the in-memory dataset to see if there is a tuple with
the same join key. If there is, the join is performed. This is known as a simple hash join
by the database community [51].
What if neither dataset fits in memory? The simplest solution is to divide the
smaller dataset, let’s say S, into n partitions, such that S = S
1
∪ S
2
∪ . . . ∪ S
n
. We
can choose n so that each partition is small enough to fit in memory, and then run
n memory-backed hash joins. This, of course, requires streaming through the other
dataset n times.
15
In contrast, recall from Section 2.2 that in Google’s implementation, reducers’ output keys must be exactly
same as their input keys.
68 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN
There is an alternative approach to memory-backed joins for cases where neither
datasets fit into memory. A distributed key-value store can be used to hold one dataset
in memory across multiple machines while mapping over the other. The mappers would
then query this distributed key-value store in parallel and perform joins if the join
keys match.
16
The open-source caching system memcached can be used for exactly
this purpose, and therefore we’ve dubbed this approach memcached join. For more
information, this approach is detailed in a technical report [95].
3.6 SUMMARY
This chapter provides a guide on the design of MapReduce algorithms. In particular,
we present a number of “design patterns” that capture effective solutions to common
problems. In summary, they are:
• “In-mapper combining”, where the functionality of the combiner is moved into the
mapper. Instead of emitting intermediate output for every input key-value pair,
the mapper aggregates partial results across multiple input records and only emits
intermediate key-value pairs after some amount of local aggregation is performed.
• The related patterns “pairs” and “stripes” for keeping track of joint events from
a large number of observations. In the pairs approach, we keep track of each joint
event separately, whereas in the stripes approach we keep track of all events that
co-occur with the same event. Although the stripes approach is significantly more
efficient, it requires memory on the order of the size of the event space, which
presents a scalability bottleneck.
• “Order inversion”, where the main idea is to convert the sequencing of compu-
tations into a sorting problem. Through careful orchestration, we can send the
reducer the result of a computation (e.g., an aggregate statistic) before it encoun-
ters the data necessary to produce that computation.
• “Value-to-key conversion”, which provides a scalable solution for secondary sort-
ing. By moving part of the value into the key, we can exploit the MapReduce
execution framework itself for sorting.
Ultimately, controlling synchronization in the MapReduce programming model boils
down to effective use of the following techniques:
1. Constructing complex keys and values that bring together data necessary for a
computation. This is used in all of the above design patterns.
16
In order to achieve good performance in accessing distributed key-value stores, it is often necessary to batch
queries before making synchronous requests (to amortize latency over many requests) or to rely on asynchronous
requests.
3.6. SUMMARY 69
2. Executing user-specified initialization and termination code in either the mapper
or reducer. For example, in-mapper combining depends on emission of intermediate
key-value pairs in the map task termination code.
3. Preserving state across multiple inputs in the mapper and reducer. This is used
in in-mapper combining, order inversion, and value-to-key conversion.
4. Controlling the sort order of intermediate keys. This is used in order inversion and
value-to-key conversion.
5. Controlling the partitioning of the intermediate key space. This is used in order
inversion and value-to-key conversion.
This concludes our overview of MapReduce algorithm design. It should be clear by now
that although the programming model forces one to express algorithms in terms of a
small set of rigidly-defined components, there are many tools at one’s disposal to shape
the flow of computation. In the next few chapters, we will focus on specific classes
of MapReduce algorithms: for inverted indexing in Chapter 4, for graph processing in
Chapter 5, and for expectation-maximization in Chapter 6.
70
C H A P T E R 4
Inverted Indexing for Text Retrieval
Web search is the quintessential large-data problem. Given an information need ex-
pressed as a short query consisting of a few terms, the system’s task is to retrieve
relevant web objects (web pages, PDF documents, PowerPoint slides, etc.) and present
them to the user. How large is the web? It is difficult to compute exactly, but even a
conservative estimate would place the size at several tens of billions of pages, totaling
hundreds of terabytes (considering text alone). In real-world applications, users demand
results quickly from a search engine—query latencies longer than a few hundred mil-
liseconds will try a user’s patience. Fulfilling these requirements is quite an engineering
feat, considering the amounts of data involved!
Nearly all retrieval engines for full-text search today rely on a data structure
called an inverted index, which given a term provides access to the list of documents
that contain the term. In information retrieval parlance, objects to be retrieved are
generically called “documents” even though in actuality they may be web pages, PDFs,
or even fragments of code. Given a user query, the retrieval engine uses the inverted
index to score documents that contain the query terms with respect to some ranking
model, taking into account features such as term matches, term proximity, attributes
of the terms in the document (e.g., bold, appears in title, etc.), as well as the hyperlink
structure of the documents (e.g., PageRank [117], which we’ll discuss in Chapter 5, or
related metrics such as HITS [84] and SALSA [88]).
The web search problem decomposes into three components: gathering web con-
tent (crawling), construction of the inverted index (indexing) and ranking documents
given a query (retrieval). Crawling and indexing share similar characteristics and re-
quirements, but these are very different from retrieval. Gathering web content and
building inverted indexes are for the most part offline problems. Both need to be scal-
able and efficient, but they do not need to operate in real time. Indexing is usually a
batch process that runs periodically: the frequency of refreshes and updates is usually
dependent on the design of the crawler. Some sites (e.g., news organizations) update
their content quite frequently and need to be visited often; other sites (e.g., government
regulations) are relatively static. However, even for rapidly changing sites, it is usually
tolerable to have a delay of a few minutes until content is searchable. Furthermore, since
the amount of content that changes rapidly is relatively small, running smaller-scale in-
dex updates at greater frequencies is usually an adequate solution.
1
Retrieval, on the
1
Leaving aside the problem of searching live data streams such a tweets, which requires different techniques and
algorithms.
4.1. WEB CRAWLING 71
other hand, is an online problem that demands sub-second response time. Individual
users expect low query latencies, but query throughput is equally important since a
retrieval engine must usually serve many users concurrently. Furthermore, query loads
are highly variable, depending on the time of day, and can exhibit “spikey” behavior
due to special circumstances (e.g., a breaking news event triggers a large number of
searches on the same topic). On the other hand, resource consumption for the indexing
problem is more predictable.
A comprehensive treatment of web search is beyond the scope of this chapter,
and even this entire book. Explicitly recognizing this, we mostly focus on the problem
of inverted indexing, the task most amenable to solutions in MapReduce. This chapter
begins by first providing an overview of web crawling (Section 4.1) and introducing the
basic structure of an inverted index (Section 4.2). A baseline inverted indexing algorithm
in MapReduce is presented in Section 4.3. We point out a scalability bottleneck in that
algorithm, which leads to a revised version presented in Section 4.4. Index compression
is discussed in Section 4.5, which fills in missing details on building compact index
structures. Since MapReduce is primarily designed for batch-oriented processing, it
does not provide an adequate solution for the retrieval problem, an issue we discuss in
Section 4.6. The chapter concludes with a summary and pointers to additional readings.
4.1 WEB CRAWLING
Before building inverted indexes, we must first acquire the document collection over
which these indexes are to be built. In academia and for research purposes, this can
be relatively straightforward. Standard collections for information retrieval research are
widely available for a variety of genres ranging from blogs to newswire text. For re-
searchers who wish to explore web-scale retrieval, there is the ClueWeb09 collection
that contains one billion web pages in ten languages (totaling 25 terabytes) crawled by
Carnegie Mellon University in early 2009.
2
Obtaining access to these standard collec-
tions is usually as simple as signing an appropriate data license from the distributor of
the collection, paying a reasonable fee, and arranging for receipt of the data.
3
For real-world web search, however, one cannot simply assume that the collection
is already available. Acquiring web content requires crawling, which is the process of
traversing the web by repeatedly following hyperlinks and storing downloaded pages
for subsequent processing. Conceptually, the process is quite simple to understand: we
start by populating a queue with a “seed” list of pages. The crawler downloads pages
in the queue, extracts links from those pages to add to the queue, stores the pages for
2
http://boston.lti.cs.cmu.edu/Data/clueweb09/
3
As an interesting side note, in the 1990s, research collections were distributed via postal mail on CD-ROMs,
and later, on DVDs. Electronic distribution became common earlier this decade for collections below a certain
size. However, many collections today are so large that the only practical method of distribution is shipping
hard drives via postal mail.
72 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
further processing, and repeats. In fact, rudimentary web crawlers can be written in a
few hundred lines of code.
However, effective and efficient web crawling is far more complex. The following
lists a number of issues that real-world crawlers must contend with:
• A web crawler must practice good “etiquette” and not overload web servers. For
example, it is common practice to wait a fixed amount of time before repeated
requests to the same server. In order to respect these constraints while maintaining
good throughput, a crawler typically keeps many execution threads running in
parallel and maintains many TCP connections (perhaps hundreds) open at the
same time.
• Since a crawler has finite bandwidth and resources, it must prioritize the order in
which unvisited pages are downloaded. Such decisions must be made online and
in an adversarial environment, in the sense that spammers actively create “link
farms” and “spider traps” full of spam pages to trick a crawler into overrepresent-
ing content from a particular site.
• Most real-world web crawlers are distributed systems that run on clusters of ma-
chines, often geographically distributed. To avoid downloading a page multiple
times and to ensure data consistency, the crawler as a whole needs mechanisms
for coordination and load-balancing. It also needs to be robust with respect to
machine failures, network outages, and errors of various types.
• Web content changes, but with different frequency depending on both the site and
the nature of the content. A web crawler needs to learn these update patterns
to ensure that content is reasonably current. Getting the right recrawl frequency
is tricky: too frequent means wasted resources, but not frequent enough leads to
stale content.
• The web is full of duplicate content. Examples include multiple copies of a popu-
lar conference paper, mirrors of frequently-accessed sites such as Wikipedia, and
newswire content that is often duplicated. The problem is compounded by the fact
that most repetitious pages are not exact duplicates but near duplicates (that is,
basically the same page but with different ads, navigation bars, etc.) It is desir-
able during the crawling process to identify near duplicates and select the best
exemplar to index.
• The web is multilingual. There is no guarantee that pages in one language only
link to pages in the same language. For example, a professor in Asia may maintain
her website in the local language, but contain links to publications in English.
4.2. INVERTED INDEXES 73
Furthermore, many pages contain a mix of text in different languages. Since doc-
ument processing techniques (e.g., tokenization, stemming) differ by language, it
is important to identify the (dominant) language on a page.
The above discussion is not meant to be an exhaustive enumeration of issues, but rather
to give the reader an appreciation of the complexities involved in this intuitively simple
task. For more information, see a recent survey on web crawling [113]. Section 4.7
provides pointers to additional readings.
4.2 INVERTED INDEXES
In its basic form, an inverted index consists of postings lists, one associated with each
term that appears in the collection.
4
The structure of an inverted index is illustrated in
Figure 4.1. A postings list is comprised of individual postings, each of which consists of
a document id and a payload—information about occurrences of the term in the doc-
ument. The simplest payload is. . . nothing! For simple boolean retrieval, no additional
information is needed in the posting other than the document id; the existence of the
posting itself indicates that presence of the term in the document. The most common
payload, however, is term frequency (tf), or the number of times the term occurs in the
document. More complex payloads include positions of every occurrence of the term in
the document (to support phrase queries and document scoring based on term proxim-
ity), properties of the term (such as if it occurred in the page title or not, to support
document ranking based on notions of importance), or even the results of additional
linguistic processing (for example, indicating that the term is part of a place name, to
support address searches). In the web context, anchor text information (text associated
with hyperlinks from other pages to the page in question) is useful in enriching the
representation of document content (e.g., [107]); this information is often stored in the
index as well.
In the example shown in Figure 4.1, we see that term
1
occurs in ¦d
1
, d
5
, d
6
, d
11
, . . .¦,
term
2
occurs in ¦d
11
, d
23
, d
59
, d
84
, . . .¦, and term
3
occurs in ¦d
1
, d
4
, d
11
, d
19
, . . .¦. In an
actual implementation, we assume that documents can be identified by a unique integer
ranging from 1 to n, where n is the total number of documents.
5
Generally, postings are
sorted by document id, although other sort orders are possible as well. The document ids
have no inherent semantic meaning, although assignment of numeric ids to documents
need not be arbitrary. For example, pages from the same domain may be consecutively
numbered. Or, alternatively, pages that are higher in quality (based, for example, on
PageRank values) might be assigned smaller numeric values so that they appear toward
4
In information retrieval parlance, term is preferred over word since documents are processed (e.g., tokenization
and stemming) into basic units that are often not words in the linguistic sense.
5
It is preferable to start numbering the documents at one since it is not possible to code zero with many common
compression schemes used in information retrieval; see Section 4.5.
74 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
term
1
term
2

terms postings
d
1
p d
5
p d
6
p d
11
p
… d
11
p d
23
p d
59
p d
84
p
term
3
… …
… d
1
p d
4
p d
11
p d
19
p
terms postings
term
1
term
2
term
3
… d
1
p d
5
p d
6
p d
11
p
… d
11
p d
23
d
23
pp d
59
p d
84
d
84
pp
… d
1
p d
4
p d
11
d
11
pp d
19
p
3
… …
1
p
4
p
11 11
pp
19
p
Figure 4.1: Simple illustration of an inverted index. Each term is associated with a list of
postings. Each posting is comprised of a document id and a payload, denoted by p in this case.
An inverted index provides quick access to documents ids that contain a term.
the front of a postings list. Either way, an auxiliary data structure is necessary to
maintain the mapping from integer document ids to some other more meaningful handle,
such as a URL.
Given a query, retrieval involves fetching postings lists associated with query terms
and traversing the postings to compute the result set. In the simplest case, boolean
retrieval involves set operations (union for boolean OR and intersection for boolean
AND) on postings lists, which can be accomplished very efficiently since the postings
are sorted by document id. In the general case, however, query–document scores must be
computed. Partial document scores are stored in structures called accumulators. At the
end (i.e., once all postings have been processed), the top k documents are then extracted
to yield a ranked list of results for the user. Of course, there are many optimization
strategies for query evaluation (both approximate and exact) that reduce the number
of postings a retrieval engine must examine.
The size of an inverted index varies, depending on the payload stored in each
posting. If only term frequency is stored, a well-optimized inverted index can be a tenth
of the size of the original document collection. An inverted index that stores positional
information would easily be several times larger than one that does not. Generally, it
is possible to hold the entire vocabulary (i.e., dictionary of all the terms) in memory,
especially with techniques such as front-coding [156]. However, with the exception of
well-resourced, commercial web search engines,
6
postings lists are usually too large to
store in memory and must be held on disk, usually in compressed form (more details in
Section 4.5). Query evaluation, therefore, necessarily involves random disk access and
“decoding” of the postings. One important aspect of the retrieval problem is to organize
disk operations such that random seeks are minimized.
6
Google keeps indexes in memory.
4.3. INVERTED INDEXING: BASELINE IMPLEMENTATION 75
1: class Mapper
2: procedure Map(docid n, doc d)
3: H ← new AssociativeArray
4: for all term t ∈ doc d do
5: H¦t¦ ← H¦t¦ + 1
6: for all term t ∈ H do
7: Emit(term t, posting ¸n, H¦t¦))
1: class Reducer
2: procedure Reduce(term t, postings [¸n
1
, f
1
), ¸n
2
, f
2
) . . .])
3: P ← new List
4: for all posting ¸a, f) ∈ postings [¸n
1
, f
1
), ¸n
2
, f
2
) . . .] do
5: Append(P, ¸a, f))
6: Sort(P)
7: Emit(term t, postings P)
Figure 4.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. Map-
pers emit postings keyed by terms, the execution framework groups postings by term, and the
reducers write postings lists to disk.
Once again, this brief discussion glosses over many complexities and does a huge
injustice to the tremendous amount of research in information retrieval. However, our
goal is to provide the reader with an overview of the important issues; Section 4.7
provides references to additional readings.
4.3 INVERTED INDEXING: BASELINE IMPLEMENTATION
MapReduce was designed from the very beginning to produce the various data struc-
tures involved in web search, including inverted indexes and the web graph. We begin
with the basic inverted indexing algorithm shown in Figure 4.2.
Input to the mapper consists of document ids (keys) paired with the actual con-
tent (values). Individual documents are processed in parallel by the mappers. First,
each document is analyzed and broken down into its component terms. The process-
ing pipeline differs depending on the application and type of document, but for web
pages typically involves stripping out HTML tags and other elements such as JavaScript
code, tokenizing, case folding, removing stopwords (common words such as ‘the’, ‘a’,
‘of’, etc.), and stemming (removing affixes from words so that ‘dogs’ becomes ‘dog’).
Once the document has been analyzed, term frequencies are computed by iterating over
all the terms and keeping track of counts. Lines 4 and 5 in the pseudo-code reflect the
process of computing term frequencies, but hides the details of document processing.
76 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
After this histogram has been built, the mapper then iterates over all terms. For each
term, a pair consisting of the document id and the term frequency is created. Each pair,
denoted by ¸n, H¦t¦) in the pseudo-code, represents an individual posting. The mapper
then emits an intermediate key-value pair with the term as the key and the posting as
the value, in line 7 of the mapper pseudo-code. Although as presented here only the
term frequency is stored in the posting, this algorithm can be easily augmented to store
additional information (e.g., term positions) in the payload.
In the shuffle and sort phase, the MapReduce runtime essentially performs a large,
distributed group by of the postings by term. Without any additional effort by the
programmer, the execution framework brings together all the postings that belong in
the same postings list. This tremendously simplifies the task of the reducer, which
simply needs to gather together all the postings and write them to disk. The reducer
begins by initializing an empty list and then appends all postings associated with the
same key (term) to the list. The postings are then sorted by document id, and the entire
postings list is emitted as a value, with the term as the key. Typically, the postings list
is first compressed, but we leave this aside for now (see Section 4.4 for more details).
The final key-value pairs are written to disk and comprise the inverted index. Since
each reducer writes its output in a separate file in the distributed file system, our final
index will be split across r files, where r is the number of reducers. There is no need to
further consolidate these files. Separately, we must also build an index to the postings
lists themselves for the retrieval engine: this is typically in the form of mappings from
term to (file, byte offset) pairs, so that given a term, the retrieval engine can fetch
its postings list by opening the appropriate file and seeking to the correct byte offset
position in that file.
Execution of the complete algorithm is illustrated in Figure 4.3 with a toy example
consisting of three documents, three mappers, and two reducers. Intermediate key-value
pairs (from the mappers) and the final key-value pairs comprising the inverted index
(from the reducers) are shown in the boxes with dotted lines. Postings are shown as
pairs of boxes, with the document id on the left and the term frequency on the right.
The MapReduce programming model provides a very concise expression of the in-
verted indexing algorithm. Its implementation is similarly concise: the basic algorithm
can be implemented in as few as a couple dozen lines of code in Hadoop (with mini-
mal document processing). Such an implementation can be completed as a week-long
programming assignment in a course for advanced undergraduates or first-year gradu-
ate students [83, 93]. In a non-MapReduce indexer, a significant fraction of the code
is devoted to grouping postings by term, given constraints imposed by memory and
disk (e.g., memory capacity is limited, disk seeks are slow, etc.). In MapReduce, the
programmer does not need to worry about any of these issues—most of the heavy lifting
is performed by the execution framework.
4.4. INVERTED INDEXING: REVISED IMPLEMENTATION 77
one fish, two fish
doc 1
red fish, blue fish
doc 2
one red bird
doc 3
mapper mapper mapper
d
1
2 fish
d
1
1 one
d
1
1 two
d
2
1 blue
d
2
2 fish
d
2
1 red
d
3
1 bird
d
3
1 one
d
3
1 red
reducer
d
1
1 two d
2
1 red d
3
1 red
Shuffle and Sort: aggregate values by keys
reducer reducer reducer
d
1
2 fish d
2
2 d
3
1 bird
d
1
1 one
d
1
1 two
d
2
1 blue
d
2
1 red d
3
1
d
3
1
Figure 4.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with
three mappers and two reducers. Postings are shown as pairs of boxes (docid, tf).
4.4 INVERTED INDEXING: REVISED IMPLEMENTATION
The inverted indexing algorithm presented in the previous section serves as a reasonable
baseline. However, there is a significant scalability bottleneck: the algorithm assumes
that there is sufficient memory to hold all postings associated with the same term. Since
the basic MapReduce execution framework makes no guarantees about the ordering of
values associated with the same key, the reducer first buffers all postings (line 5 of the
reducer pseudo-code in Figure 4.2) and then performs an in-memory sort before writing
the postings to disk.
7
For efficient retrieval, postings need to be sorted by document id.
However, as collections become larger, postings lists grow longer, and at some point in
time, reducers will run out of memory.
There is a simple solution to this problem. Since the execution framework guaran-
tees that keys arrive at each reducer in sorted order, one way to overcome the scalability
7
See similar discussion in Section 3.4: in principle, this need not be an in-memory sort. It is entirely possible to
implement a disk-based sort within the reducer.
78 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
bottleneck is to let the MapReduce runtime do the sorting for us. Instead of emitting
key-value pairs of the following type:
(term t, posting ¸docid, f))
We emit intermediate key-value pairs of the type instead:
(tuple ¸t, docid), tf f)
In other words, the key is a tuple containing the term and the document id, while the
value is the term frequency. This is exactly the value-to-key conversion design pattern
introduced in Section 3.4. With this modification, the programming model ensures that
the postings arrive in the correct order. This, combined with the fact that reducers can
hold state across multiple keys, allows postings lists to be created with minimal memory
usage. As a detail, remember that we must define a custom partitioner to ensure that
all tuples with the same term are shuffled to the same reducer.
The revised MapReduce inverted indexing algorithm is shown in Figure 4.4. The
mapper remains unchanged for the most part, other than differences in the intermediate
key-value pairs. The Reduce method is called for each key (i.e., ¸t, n)), and by design,
there will only be one value associated with each key. For each key-value pair, a posting
can be directly added to the postings list. Since the postings are guaranteed to arrive
in sorted order by document id, they can be incrementally coded in compressed form—
thus ensuring a small memory footprint. Finally, when all postings associated with the
same term have been processed (i.e., t ,= t
prev
), the entire postings list is emitted. The
final postings list must be written out in the Close method. As with the baseline
algorithm, payloads can be easily changed: by simply replacing the intermediate value
f (term frequency) with whatever else is desired (e.g., term positional information).
There is one more detail we must address when building inverted indexes. Since
almost all retrieval models take into account document length when computing query–
document scores, this information must also be extracted. Although it is straightforward
to express this computation as another MapReduce job, this task can actually be folded
into the inverted indexing process. When processing the terms in each document, the
document length is known, and can be written out as “side data” directly to HDFS.
We can take advantage of the ability for a mapper to hold state across the processing of
multiple documents in the following manner: an in-memory associative array is created
to store document lengths, which is populated as each document is processed.
8
When
the mapper finishes processing input records, document lengths are written out to
HDFS (i.e., in the Close method). This approach is essentially a variant of the in-
mapper combining pattern. Document length data ends up in m different files, where
m is the number of mappers; these files are then consolidated into a more compact
8
In general, there is no worry about insufficient memory to hold these data.
4.5. INDEX COMPRESSION 79
1: class Mapper
2: method Map(docid n, doc d)
3: H ← new AssociativeArray
4: for all term t ∈ doc d do
5: H¦t¦ ← H¦t¦ + 1
6: for all term t ∈ H do
7: Emit(tuple ¸t, n), tf H¦t¦)
1: class Reducer
2: method Initialize
3: t
prev
← ∅
4: P ← new PostingsList
5: method Reduce(tuple ¸t, n), tf [f])
6: if t ,= t
prev
∧ t
prev
,= ∅ then
7: Emit(term t, postings P)
8: P.Reset()
9: P.Add(¸n, f))
10: t
prev
← t
11: method Close
12: Emit(term t, postings P)
Figure 4.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. By applying
the value-to-key conversion design pattern, the execution framework is exploited to sort postings
so that they arrive sorted by document id in the reducer.
representation. Alternatively, document length information can be emitted in special
key-value pairs by the mapper. One must then write a custom partitioner so that these
special key-value pairs are shuffled to a single reducer, which will be responsible for
writing out the length data separate from the postings lists.
4.5 INDEX COMPRESSION
We return to the question of how postings are actually compressed and stored on disk.
This chapter devotes a substantial amount of space to this topic because index com-
pression is one of the main differences between a “toy” indexer and one that works on
real-world collections. Otherwise, MapReduce inverted indexing algorithms are pretty
straightforward.
Let us consider the canonical case where each posting consists of a document id
and the term frequency. A na¨ıve implementation might represent the first as a 32-bit
80 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
integer
9
and the second as a 16-bit integer. Thus, a postings list might be encoded as
follows:
[(5, 2), (7, 3), (12, 1), (49, 1), (51, 2), . . .]
where each posting is represented by a pair in parentheses. Note that all brackets, paren-
theses, and commas are only included to enhance readability; in reality the postings
would be represented as a long stream of integers. This na¨ıve implementation would
require six bytes per posting. Using this scheme, the entire inverted index would be
about as large as the collection itself. Fortunately, we can do significantly better.
The first trick is to encode differences between document ids as opposed to the
document ids themselves. Since the postings are sorted by document ids, the differences
(called d-gaps) must be positive integers greater than zero. The above postings list,
represented with d-gaps, would be:
[(5, 2), (2, 3), (5, 1), (37, 1), (2, 2), . . .]
Of course, we must actually encode the first document id. We haven’t lost any infor-
mation, since the original document ids can be easily reconstructed from the d-gaps.
However, it’s not obvious that we’ve reduced the space requirements either, since the
largest possible d-gap is one less than the number of documents in the collection.
This is where the second trick comes in, which is to represent the d-gaps in a
way such that it takes less space for smaller numbers. Similarly, we want to apply the
same techniques to compress the term frequencies, since for the most part they are also
small values. But to understand how this is done, we need to take a slight detour into
compression techniques, particularly for coding integers.
Compression, in general, can be characterized as either lossless or lossy: it’s fairly
obvious that loseless compression is required in this context. To start, it is important
to understand that all compression techniques represent a time–space tradeoff. That
is, we reduce the amount of space on disk necessary to store data, but at the cost of
extra processor cycles that must be spent coding and decoding data. Therefore, it is
possible that compression reduces size but also slows processing. However, if the two
factors are properly balanced (i.e., decoding speed can keep up with disk bandwidth),
we can achieve the best of both worlds: smaller and faster.
4.5.1 BYTE-ALIGNED AND WORD-ALIGNED CODES
In most programming languages, an integer is encoded in four bytes and holds a value
between 0 and 2
32
−1, inclusive. We limit our discussion to unsigned integers, since d-
gaps are always positive (and greater than zero). This means that 1 and 4,294,967,295
9
However, note that 2
32
−1 is “only” 4,294,967,295, which is much less than even the most conservative estimate
of the size of the web.
4.5. INDEX COMPRESSION 81
both occupy four bytes. Obviously, encoding d-gaps this way doesn’t yield any reduc-
tions in size.
A simple approach to compression is to only use as many bytes as is necessary to
represent the integer. This is known as variable-length integer coding (varInt for short)
and accomplished by using the high order bit of every byte as the continuation bit,
which is set to one in the last byte and zero elsewhere. As a result, we have 7 bits per
byte for coding the value, which means that 0 ≤ n < 2
7
can be expressed with 1 byte,
2
7
≤ n < 2
14
with 2 bytes, 2
14
≤ n < 2
21
with 3, and 2
21
≤ n < 2
28
with 4 bytes. This
scheme can be extended to code arbitrarily-large integers (i.e., beyond 4 bytes). As a
concrete example, the two numbers:
127, 128
would be coded as such:
1 1111111, 0 0000001 1 0000000
The above code contains two code words, the first consisting of 1 byte, and the second
consisting of 2 bytes. Of course, the comma and the spaces are there only for readability.
Variable-length integers are byte-aligned because the code words always fall along byte
boundaries. As a result, there is never any ambiguity about where one code word ends
and the next begins. However, the downside of varInt coding is that decoding involves
lots of bit operations (masks, shifts). Furthermore, the continuation bit sometimes re-
sults in frequent branch mispredicts (depending on the actual distribution of d-gaps),
which slows down processing.
A variant of the varInt scheme was described by Jeff Dean in a keynote talk at
the WSDM 2009 conference.
10
The insight is to code groups of four integers at a time.
Each group begins with a prefix byte, divided into four 2-bit values that specify the
byte length of each of the following integers. For example, the following prefix byte:
00,00,01,10
indicates that the following four integers are one byte, one byte, two bytes, and three
bytes, respectively. Therefore, each group of four integers would consume anywhere be-
tween 5 and 17 bytes. A simple lookup table based on the prefix byte directs the decoder
on how to process subsequent bytes to recover the coded integers. The advantage of this
group varInt coding scheme is that values can be decoded with fewer branch mispredicts
and bitwise operations. Experiments reported by Dean suggest that decoding integers
with this scheme is more than twice as fast as the basic varInt scheme.
In most architectures, accessing entire machine words is more efficient than fetch-
ing all its bytes separately. Therefore, it makes sense to store postings in increments
10
http://research.google.com/people/jeff/WSDM09-keynote.pdf
82 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
of 16-bit, 32-bit, or 64-bit machine words. Anh and Moffat [8] presented several word-
aligned coding methods, one of which is called Simple-9, based on 32-bit words. In this
coding scheme, four bits in each 32-bit word are reserved as a selector. The remaining
28 bits are used to code actual integer values. Now, there are a variety of ways these 28
bits can be divided to code one or more integers: 28 bits can be used to code one 28-bit
integer, two 14-bit integers, three 9-bit integers (with one bit unused), etc., all the way
up to twenty-eight 1-bit integers. In fact, there are nine different ways the 28 bits can be
divided into equal parts (hence the name of the technique), some with leftover unused
bits. This is stored in the selector bits. Therefore, decoding involves reading a 32-bit
word, examining the selector to see how the remaining 28 bits are packed, and then
appropriately decoding each integer. Coding works in the opposite way: the algorithm
scans ahead to see how many integers can be squeezed into 28 bits, packs those integers,
and sets the selector bits appropriately.
4.5.2 BIT-ALIGNED CODES
The advantage of byte-aligned and word-aligned codes is that they can be coded and
decoded quickly. The downside, however, is that they must consume multiples of eight
bits, even when fewer bits might suffice (the Simple-9 scheme gets around this by
packing multiple integers into a 32-bit word, but even then, bits are often wasted).
In bit-aligned codes, on the other hand, code words can occupy any number of bits,
meaning that boundaries can fall anywhere. In practice, coding and decoding bit-aligned
codes require processing bytes and appropriately shifting or masking bits (usually more
involved than varInt and group varInt coding).
One additional challenge with bit-aligned codes is that we need a mechanism to
delimit code words, i.e., tell where the last ends and the next begins, since there are
no byte boundaries to guide us. To address this issue, most bit-aligned codes are so-
called prefix codes (confusingly, they are also called prefix-free codes), in which no valid
code word is a prefix of any other valid code word. For example, coding 0 ≤ x < 3 with
¦0, 1, 01¦ is not a valid prefix code, since 0 is a prefix of 01, and so we can’t tell if 01 is
two code words or one. On the other hand, ¦00, 01, 1¦ is a valid prefix code, such that
a sequence of bits:
0001101001010100
can be unambiguously segmented into:
00 01 1 01 00 1 01 01 00
and decoded without any additional delimiters.
One of the simplest prefix codes is the unary code. An integer x > 0 is coded as x −
1 one bits followed by a zero bit. Note that unary codes do not allow the representation
4.5. INDEX COMPRESSION 83
Golomb
x unary γ b = 5 b = 10
1 0 0 0:00 0:000
2 10 10:0 0:01 0:001
3 110 10:1 0:10 0:010
4 1110 110:00 0:110 0:011
5 11110 110:01 0:111 0:100
6 111110 110:10 10:00 0:101
7 1111110 110:11 10:01 0:1100
8 11111110 1110:000 10:10 0:1101
9 111111110 1110:001 10:110 0:1110
10 1111111110 1110:010 10:111 0:1111
Figure 4.5: The first ten positive integers in unary, γ, and Golomb (b = 5, 10) codes.
of zero, which is fine since d-gaps and term frequencies should never be zero.
11
As an
example, 4 in unary code is 1110. With unary code we can code x in x bits, which
although economical for small values, becomes inefficient for even moderately large
values. Unary codes are rarely used by themselves, but form a component of other
coding schemes. Unary codes of the first ten positive integers are shown in Figure 4.5.
Elias γ code is an efficient coding scheme that is widely used in practice. An integer
x > 0 is broken into two components, 1 +¸log
2
x| (= n, the length), which is coded in
unary code, and x −2
¡log
2

(= r, the remainder), which is in binary.
12
The unary
component n specifies the number of bits required to code x, and the binary component
codes the remainder r in n −1 bits. As an example, consider x = 10: 1 +¸log
2
10| =
4, which is 1110. The binary component codes x −2
3
= 2 in 4 −1 = 3 bits, which is
010. Putting both together, we arrive at 1110:010. The extra colon is inserted only for
readability; it’s not part of the final code, of course.
Working in reverse, it is easy to unambiguously decode a bit stream of γ codes:
First, we read a unary code c
u
, which is a prefix code. This tells us that the binary
portion is written in c
u
−1 bits, which we then read as c
b
. We can then reconstruct x
as 2
cu−1
+ c
b
. For x < 16, γ codes occupy less than a full byte, which makes them more
compact than variable-length integer codes. Since term frequencies for the most part are
relatively small, γ codes make sense for them and can yield substantial space savings.
For reference, the γ codes of the first ten positive integers are shown in Figure 4.5. A
11
As a note, some sources describe slightly different formulations of the same coding scheme. Here, we adopt the
conventions used in the classic IR text Managing Gigabytes [156].
12
Note that x is the floor function, which maps x to the largest integer not greater than x, so, e.g., 3.8 = 3.
This is the default behavior in many programming languages when casting from a floating-point type to an
integer type.
84 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
variation on γ code is δ code, where the n portion of the γ code is coded in γ code
itself (as opposed to unary code). For smaller values γ codes are more compact, but for
larger values, δ codes take less space.
Unary and γ codes are parameterless, but even better compression can be achieved
with parameterized codes. A good example of this is Golomb code. For some parameter
b, an integer x > 0 is coded in two parts: first, we compute q = ¸(x −1)/b| and code
q + 1 in unary; then, we code the remainder r = x −qb −1 in truncated binary. This
is accomplished as follows: if b is a power of two, then truncated binary is exactly the
same as normal binary, requiring log
2
b bits. Otherwise, we code the first 2
¡log
2
b¸+1
−b
values of r in ¸log
2
b| bits and code the rest of the values of r by coding r + 2
¡log
2
b¸+1
−b
in ordinary binary representation using ¸log
2
b| + 1 bits. In this case, the r is coded in
either ¸log
2
b| or ¸log
2
b| + 1 bits, and unlike ordinary binary coding, truncated binary
codes are prefix codes. As an example, if b = 5, then r can take the values ¦0, 1, 2, 3, 4¦,
which would be coded with the following code words: ¦00, 01, 10, 110, 111¦. For reference,
Golomb codes of the first ten positive integers are shown in Figure 4.5 for b = 5 and
b = 10. A special case of Golomb code is worth noting: if b is a power of two, then
coding and decoding can be handled more efficiently (needing only bit shifts and bit
masks, as opposed to multiplication and division). These are known as Rice codes.
Researchers have shown that Golomb compression works well for d-gaps, and is
optimal with the following parameter setting:
b ≈ 0.69
df
N
(4.1)
where df is the document frequency of the term, and N is the number of documents in
the collection.
13
Putting everything together, one popular approach for postings compression is to
represent d-gaps with Golomb codes and term frequencies with γ codes [156, 162]. If
positional information is desired, we can use the same trick to code differences between
term positions using γ codes.
4.5.3 POSTINGS COMPRESSION
Having completed our slight detour into integer compression techniques, we can now
return to the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how
postings lists can be properly compressed. As we can see from the previous section,
there is a wide range of choices that represent different tradeoffs between compression
ratio and decoding speed. Actual performance also depends on characteristics of the
collection, which, among other factors, determine the distribution of d-gaps. B¨ uttcher
13
For details as to why this is the case, we refer the reader elsewhere [156], but here’s the intuition: under
reasonable assumptions, the appearance of postings can be modeled as a sequence of independent Bernoulli
trials, which implies a certain distribution of d-gaps. From this we can derive an optimal setting of b.
4.5. INDEX COMPRESSION 85
et al. [30] recently compared the performance of various compression techniques on
coding document ids. In terms of the amount of compression that can be obtained
(measured in bits per docid), Golomb and Rice codes performed the best, followed by
γ codes, Simple-9, varInt, and group varInt (the least space efficient). In terms of raw
decoding speed, the order was almost the reverse: group varInt was the fastest, followed
by varInt.
14
Simple-9 was substantially slower, and the bit-aligned codes were even
slower than that. Within the bit-aligned codes, Rice codes were the fastest, followed by
γ, with Golomb codes being the slowest (about ten times slower than group varInt).
Let us discuss what modifications are necessary to our inverted indexing algorithm
if we were to adopt Golomb compression for d-gaps and represent term frequencies
with γ codes. Note that this represents a space-efficient encoding, at the cost of slower
decoding compared to alternatives. Whether or not this is actually a worthwhile tradeoff
in practice is not important here: use of Golomb codes serves a pedagogical purpose, to
illustrate how one might set compression parameters.
Coding term frequencies with γ codes is easy since they are parameterless. Com-
pressing d-gaps with Golomb codes, however, is a bit tricky, since two parameters are
required: the size of the document collection and the number of postings for a particular
postings list (i.e., the document frequency, or df). The first is easy to obtain and can be
passed into the reducer as a constant. The df of a term, however, is not known until all
the postings have been processed—and unfortunately, the parameter must be known
before any posting is coded. At first glance, this seems like a chicken-and-egg problem.
A two-pass solution that involves first buffering the postings (in memory) would suffer
from the memory bottleneck we’ve been trying to avoid in the first place.
To get around this problem, we need to somehow inform the reducer of a term’s
df before any of its postings arrive. This can be solved with the order inversion design
pattern introduced in Section 3.3 to compute relative frequencies. The solution is to
have the mapper emit special keys of the form ¸t, ∗) to communicate partial document
frequencies. That is, inside the mapper, in addition to emitting intermediate key-value
pairs of the following form:
(tuple ¸t, docid), tf f)
we also emit special intermediate key-value pairs like this:
(tuple ¸t, ∗), df e)
to keep track of document frequencies associated with each term. In practice, we can
accomplish this by applying the in-mapper combining design pattern (see Section 3.1).
The mapper holds an in-memory associative array that keeps track of how many doc-
uments a term has been observed in (i.e., the local document frequency of the term for
14
However, this study found less speed difference between group varInt and basic varInt than Dean’s analysis,
presumably due to the different distribution of d-gaps in the collections they were examining.
86 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
the subset of documents processed by the mapper). Once the mapper has processed all
input records, special keys of the form ¸t, ∗) are emitted with the partial df as the value.
To ensure that these special keys arrive first, we define the sort order of the
tuple so that the special symbol ∗ precedes all documents (part of the order inversion
design pattern). Thus, for each term, the reducer will first encounter the ¸t, ∗) key,
associated with a list of values representing partial df values originating from each
mapper. Summing all these partial contributions will yield the term’s df, which can
then be used to set the Golomb compression parameter b. This allows the postings to be
incrementally compressed as they are encountered in the reducer—memory bottlenecks
are eliminated since we do not need to buffer postings in memory.
Once again, the order inversion design pattern comes to the rescue. Recall that
the pattern is useful when a reducer needs to access the result of a computation (e.g.,
an aggregate statistic) before it encounters the data necessary to produce that compu-
tation. For computing relative frequencies, that bit of information was the marginal. In
this case, it’s the document frequency.
4.6 WHAT ABOUT RETRIEVAL?
Thus far, we have briefly discussed web crawling and focused mostly on MapReduce
algorithms for inverted indexing. What about retrieval? It should be fairly obvious
that MapReduce, which was designed for large batch operations, is a poor solution for
retrieval. Since users demand sub-second response times, every aspect of retrieval must
be optimized for low latency, which is exactly the opposite tradeoff made in MapReduce.
Recall the basic retrieval problem: we must look up postings lists corresponding to query
terms, systematically traverse those postings lists to compute query–document scores,
and then return the top k results to the user. Looking up postings implies random disk
seeks, since for the most part postings are too large to fit into memory (leaving aside
caching and other special cases for now). Unfortunately, random access is not a forte
of the distributed file system underlying MapReduce—such operations require multiple
round-trip network exchanges (and associated latencies). In HDFS, a client must first
obtain the location of the desired data block from the namenode before the appropriate
datanode can be contacted for the actual data. Of course, access will typically require
a random disk seek on the datanode itself.
It should be fairly obvious that serving the search needs of a large number of
users, each of whom demand sub-second response times, is beyond the capabilities of
any single machine. The only solution is to distribute retrieval across a large number
of machines, which necessitates breaking up the index in some manner. There are two
main partitioning strategies for distributed retrieval: document partitioning and term
partitioning. Under document partitioning, the entire collection is broken up into mul-
tiple smaller sub-collections, each of which is assigned to a server. In other words, each
4.6. WHAT ABOUT RETRIEVAL? 87
d
1
d
2
d
3
d
4
d
5
d
6
d
7
d
8
d
9
2 3
1 1 4
1 1 2
t
1
t
2
t
3
partition
a
5 2 2
1 1 3
2 1
t
4
t
5
t
partition
b
2 1
2 1 4
1 2 3
t
6
t
7
t
8
partition
c
1 2 1 t
9
partition
1
partition
2
partition
3
Figure 4.6: Term–document matrix for a toy collection (nine documents, nine terms) illus-
trating different partitioning strategies: partitioning vertically (1, 2, 3) corresponds to document
partitioning, whereas partitioning horizontally (a, b, c) corresponds to term partitioning.
server holds the complete index for a subset of the entire collection. This corresponds
to partitioning vertically in Figure 4.6. With term partitioning, on the other hand,
each server is responsible for a subset of the terms for the entire collection. That is, a
server holds the postings for all documents in the collection for a subset of terms. This
corresponds to partitioning horizontally in Figure 4.6.
Document and term partitioning require different retrieval strategies and represent
different tradeoffs. Retrieval under document partitioning involves a query broker, which
forwards the user’s query to all partition servers, merges partial results from each, and
then returns the final results to the user. With this architecture, searching the entire
collection requires that the query be processed by every partition server. However,
since each partition operates independently and traverses postings in parallel, document
partitioning typically yields shorter query latencies (compared to a single monolithic
index with much longer postings lists).
Retrieval under term partitioning, on the other hand, requires a very different
strategy. Suppose the user’s query Q contains three terms, q
1
, q
2
, and q
3
. Under the
88 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
pipelined query evaluation strategy, the broker begins by forwarding the query to the
server that holds the postings for q
1
(usually the least frequent term). The server tra-
verses the appropriate postings list and computes partial query–document scores, stored
in the accumulators. The accumulators are then passed to the server that holds the post-
ings associated with q
2
for additional processing, and then to the server for q
3
, before
final results are passed back to the broker and returned to the user. Although this
query evaluation strategy may not substantially reduce the latency of any particular
query, it can theoretically increase a system’s throughput due to the far smaller number
of total disk seeks required for each user query (compared to document partitioning).
However, load-balancing is tricky in a pipelined term-partitioned architecture due to
skew in the distribution of query terms, which can create “hot spots” on servers that
hold the postings for frequently-occurring query terms.
In general, studies have shown that document partitioning is a better strategy
overall [109], and this is the strategy adopted by Google [16]. Furthermore, it is known
that Google maintains its indexes in memory (although this is certainly not the common
case for search engines in general). One key advantage of document partitioning is
that result quality degrades gracefully with machine failures. Partition servers that are
offline will simply fail to deliver results for their subsets of the collection. With sufficient
partitions, users might not even be aware that documents are missing. For most queries,
the web contains more relevant documents than any user has time to digest: users of
course care about getting relevant documents (sometimes, they are happy with a single
relevant document), but they are generally less discriminating when it comes to which
relevant documents appear in their results (out of the set of all relevant documents).
Note that partitions may be unavailable due to reasons other than machine failure:
cycling through different partitions is a very simple and non-disruptive strategy for
index updates.
Working in a document-partitioned architecture, there are a variety of approaches
to dividing up the web into smaller pieces. Proper partitioning of the collection can
address one major weakness of this architecture, which is that every partition server
is involved in every user query. Along one dimension, it is desirable to partition by
document quality using one or more classifiers; see [124] for a recent survey on web
page classification. Partitioning by document quality supports a multi-phase search
strategy: the system examines partitions containing high quality documents first, and
only backs off to partitions containing lower quality documents if necessary. This reduces
the number of servers that need to be contacted for a user query. Along an orthogonal
dimension, it is desirable to partition documents by content (perhaps also guided by
the distribution of user queries from logs), so that each partition is “well separated”
from the others in terms of topical coverage. This also reduces the number of machines
that need to be involved in serving a user’s query: the broker can direct queries only to
4.7. SUMMARY AND ADDITIONAL READINGS 89
the partitions that are likely to contain relevant documents, as opposed to forwarding
the user query to all the partitions.
On a large-scale, reliability of service is provided by replication, both in terms
of multiple machines serving the same partition within a single datacenter, but also
replication across geographically-distributed datacenters. This creates at least two query
routing problems: since it makes sense to serve clients from the closest datacenter, a
service must route queries to the appropriate location. Within a single datacenter, the
system needs to properly balance load across replicas.
There are two final components of real-world search engines that are worth dis-
cussing. First, recall that postings only store document ids. Therefore, raw retrieval
results consist of a ranked list of semantically meaningless document ids. It is typically
the responsibility of document servers, functionally distinct from the partition servers
holding the indexes, to generate meaningful output for user presentation. Abstractly, a
document server takes as input a query and a document id, and computes an appropri-
ate result entry, typically comprising the title and URL of the page, a snippet of the
source document showing the user’s query terms in context, and additional metadata
about the document. Second, query evaluation can benefit immensely from caching, of
individual postings (assuming that the index is not already in memory) and even results
of entire queries [13]. This is made possible by the Zipfian distribution of queries, with
very frequent queries at the head of the distribution dominating the total number of
queries. Search engines take advantage of this with cache servers, which are functionally
distinct from all of the components discussed above.
4.7 SUMMARY AND ADDITIONAL READINGS
Web search is a complex problem that breaks down into three conceptually-distinct
components. First, the documents collection must be gathered (by crawling the web).
Next, inverted indexes and other auxiliary data structures must be built from the docu-
ments. Both of these can be considered offline problems. Finally, index structures must
be accessed and processed in response to user queries to generate search results. This
last task is an online problem that demands both low latency and high throughput.
This chapter primarily focused on building inverted indexes, the problem most
suitable for MapReduce. After all, inverted indexing is nothing but a very large dis-
tributed sort and group by operation! We began with a baseline implementation of an
inverted indexing algorithm, but quickly noticed a scalability bottleneck that stemmed
from having to buffer postings in memory. Application of the value-to-key conversion
design pattern (Section 3.4) addressed the issue by offloading the task of sorting post-
ings by document id to the MapReduce execution framework. We also surveyed various
techniques for integer compression, which yield postings lists that are both more com-
pact and faster to process. As a specific example, one could use Golomb codes for
90 CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL
compressing d-gaps and γ codes for term frequencies. We showed how the order inver-
sion design pattern introduced in Section 3.3 for computing relative frequencies can be
used to properly set compression parameters.
Additional Readings. Our brief discussion of web search glosses over many com-
plexities and does a huge injustice to the tremendous amount of research in information
retrieval. Here, however, we provide a few entry points into the literature. A survey ar-
ticle by Zobel and Moffat [162] is an excellent starting point on indexing and retrieval
algorithms. Another by Baeza-Yates et al. [11] overviews many important issues in
distributed retrieval. A keynote talk at the WSDM 2009 conference by Jeff Dean re-
vealed a lot of information about the evolution of the Google search architecture.
15
Finally, a number of general information retrieval textbooks have been recently pub-
lished [101, 42, 30]. Of these three, the one by B¨ uttcher et al. [30] is noteworthy in
having detailed experimental evaluations that compare the performance (both effec-
tiveness and efficiency) of a wide range of algorithms and techniques. While outdated
in many other respects, the textbook Managing Gigabytes [156] remains an excellent
source for index compression techniques. Finally, ACM SIGIR is an annual conference
and the most prestigious venue for academic information retrieval research; proceedings
from those events are perhaps the best starting point for those wishing to keep abreast
of publicly-documented developments in the field.
15
http://research.google.com/people/jeff/WSDM09-keynote.pdf
91
C H A P T E R 5
Graph Algorithms
Graphs are ubiquitous in modern society: examples encountered by almost everyone
on a daily basis include the hyperlink structure of the web (simply known as the web
graph), social networks (manifest in the flow of email, phone call patterns, connections
on social networking sites, etc.), and transportation networks (roads, bus routes, flights,
etc.). Our very own existence is dependent on an intricate metabolic and regulatory
network, which can be characterized as a large, complex graph involving interactions
between genes, proteins, and other cellular products. This chapter focuses on graph
algorithms in MapReduce. Although most of the content has nothing to do with text
processing per se, documents frequently exist in the context of some underlying network,
making graph analysis an important component of many text processing applications.
Perhaps the best known example is PageRank, a measure of web page quality based
on the structure of hyperlinks, which is used in ranking results for web search. As one
of the first applications of MapReduce, PageRank exemplifies a large class of graph
algorithms that can be concisely captured in the programming model. We will discuss
PageRank in detail later this chapter.
In general, graphs can be characterized by nodes (or vertices) and links (or edges)
that connect pairs of nodes.
1
These connections can be directed or undirected. In some
graphs, there may be an edge from a node to itself, resulting in a self loop; in others,
such edges are disallowed. We assume that both nodes and links may be annotated with
additional metadata: as a simple example, in a social network where nodes represent
individuals, there might be demographic information (e.g., age, gender, location) at-
tached to the nodes and type information attached to the links (e.g., indicating type of
relationship such as “friend” or “spouse”).
Mathematicians have always been fascinated with graphs, dating back to Euler’s
paper on the Seven Bridges of K¨onigsberg in 1736. Over the past few centuries, graphs
have been extensively studied, and today much is known about their properties. Far
more than theoretical curiosities, theorems and algorithms on graphs can be applied to
solve many real-world problems:
• Graph search and path planning. Search algorithms on graphs are invoked millions
of times a day, whenever anyone searches for directions on the web. Similar algo-
rithms are also involved in friend recommendations and expert-finding in social
networks. Path planning problems involving everything from network packets to
delivery trucks represent another large class of graph search problems.
1
Throughout this chapter, we use node interchangeably with vertex and similarly with link and edge.
92 CHAPTER 5. GRAPH ALGORITHMS
• Graph clustering. Can a large graph be divided into components that are relatively
disjoint (for example, as measured by inter-component links [59])? Among other
applications, this task is useful for identifying communities in social networks (of
interest to sociologists who wish to understand how human relationships form and
evolve) and for partitioning large graphs (of interest to computer scientists who
seek to better parallelize graph processing). See [158] for a survey.
• Minimum spanning trees. A minimum spanning tree for a graph G with weighted
edges is a tree that contains all vertices of the graph and a subset of edges that
minimizes the sum of edge weights. A real-world example of this problem is a
telecommunications company that wishes to lay optical fiber to span a number
of destinations at the lowest possible cost (where weights denote costs). This ap-
proach has also been applied to wide variety of problems, including social networks
and the migration of Polynesian islanders [64].
• Bipartite graph matching. A bipartite graph is one whose vertices can be divided
into two disjoint sets. Matching problems on such graphs can be used to model
job seekers looking for employment or singles looking for dates.
• Maximum flow. In a weighted directed graph with two special nodes called the
source and the sink, the max flow problem involves computing the amount of
“traffic” that can be sent from source to sink given various flow capacities defined
by edge weights. Transportation companies (airlines, shipping, etc.) and network
operators grapple with complex versions of these problems on a daily basis.
• Identifying “special” nodes. There are many ways to define what special means,
including metrics based on node in-degree, average distance to other nodes, and
relationship to cluster structure. These special nodes are important to investigators
attempting to break up terrorist cells, epidemiologists modeling the spread of
diseases, advertisers trying to promote products, and many others.
A common feature of these problems is the scale of the datasets on which the algorithms
must operate: for example, the hyperlink structure of the web, which contains billions
of pages, or social networks that contain hundreds of millions of individuals. Clearly,
algorithms that run on a single machine and depend on the entire graph residing in
memory are not scalable. We’d like to put MapReduce to work on these challenges.
2
This chapter is organized as follows: we begin in Section 5.1 with an introduction
to graph representations, and then explore two classic graph algorithms in MapReduce:
2
As a side note, Google recently published a short description of a system called Pregel [98], based on Valiant’s
Bulk Synchronous Parallel model [148], for large-scale graph algorithms; a longer description is anticipated in
a forthcoming paper [99]
5.1. GRAPH REPRESENTATIONS 93
n
1
n
2
n
1
n
2
n
3
n
4
n
5
n
1
0 1 0 1 0
n
2
0 0 1 0 1
n
1
[n
2
, n
4
]
n
2
[n
3
, n
5
]
n
3
n
5
n
2
0 0 1 0 1
n
3
0 0 0 1 0
n
4
0 0 0 0 1
n
5
1 1 1 0 0
n
2
[n
3
, n
5
]
n
3
[n
4
]
n
4
[n
5
]
n
5
[n
1
, n
2
, n
3
]
n
4
adjacency matrix adjacency lists
Figure 5.1: A simple directed graph (left) represented as an adjacency matrix (middle) and
with adjacency lists (right).
parallel breadth-first search (Section 5.2) and PageRank (Section 5.3). Before conclud-
ing with a summary and pointing out additional readings, Section 5.4 discusses a number
of general issue that affect graph processing with MapReduce.
5.1 GRAPH REPRESENTATIONS
One common way to represent graphs is with an adjacency matrix. A graph with n nodes
can be represented as an n n square matrix M, where a value in cell m
ij
indicates an
edge from node n
i
to node n
j
. In the case of graphs with weighted edges, the matrix cells
contain edge weights; otherwise, each cell contains either a one (indicating an edge),
or a zero (indicating none). With undirected graphs, only half the matrix is used (e.g.,
cells above the diagonal). For graphs that allow self loops (a directed edge from a node
to itself), the diagonal might be populated; otherwise, the diagonal remains empty.
Figure 5.1 provides an example of a simple directed graph (left) and its adjacency
matrix representation (middle).
Although mathematicians prefer the adjacency matrix representation of graphs
for easy manipulation with linear algebra, such a representation is far from ideal for
computer scientists concerned with efficient algorithmic implementations. Most of the
applications discussed in the chapter introduction involve sparse graphs, where the
number of actual edges is far smaller than the number of possible edges.
3
For example,
in a social network of n individuals, there are n(n −1) possible “friendships” (where n
may be on the order of hundreds of millions). However, even the most gregarious will
have relatively few friends compared to the size of the network (thousands, perhaps, but
still far smaller than hundreds of millions). The same is true for the hyperlink structure
of the web: each individual web page links to a minuscule portion of all the pages on the
3
Unfortunately, there is no precise definition of sparseness agreed upon by all, but one common definition is
that a sparse graph has O(n) edges, where n is the number of vertices.
94 CHAPTER 5. GRAPH ALGORITHMS
web. In this chapter, we assume processing of sparse graphs, although we will return to
this issue in Section 5.4.
The major problem with an adjacency matrix representation for sparse graphs
is its O(n
2
) space requirement. Furthermore, most of the cells are zero, by definition.
As a result, most computational implementations of graph algorithms operate over
adjacency lists, in which a node is associated with neighbors that can be reached via
outgoing edges. Figure 5.1 also shows the adjacency list representation of the graph
under consideration (on the right). For example, since n
1
is connected by directed
edges to n
2
and n
4
, those two nodes will be on the adjacency list of n
1
. There are two
options for encoding undirected graphs: one could simply encode each edge twice (if n
i
and n
j
are connected, each appears on each other’s adjacency list). Alternatively, one
could order the nodes (arbitrarily or otherwise) and encode edges only on the adjacency
list of the node that comes first in the ordering (i.e., if i < j, then n
j
is on the adjacency
list of n
i
, but not the other way around).
Note that certain graph operations are easier on adjacency matrices than on ad-
jacency lists. In the first, operations on incoming links for each node translate into a
column scan on the matrix, whereas operations on outgoing links for each node trans-
late into a row scan. With adjacency lists, it is natural to operate on outgoing links, but
computing anything that requires knowledge of the incoming links of a node is difficult.
However, as we shall see, the shuffle and sort mechanism in MapReduce provides an
easy way to group edges by their destination nodes, thus allowing us to compute over
incoming edges with in the reducer. This property of the execution framework can also
be used to invert the edges of a directed graph, by mapping over the nodes’ adjacency
lists and emitting key–value pairs with the destination node id as the key and the source
node id as the value.
4
5.2 PARALLEL BREADTH-FIRST SEARCH
One of the most common and well-studied problems in graph theory is the single-source
shortest path problem, where the task is to find shortest paths from a source node to
all other nodes in the graph (or alternatively, edges can be associated with costs or
weights, in which case the task is to compute lowest-cost or lowest-weight paths). Such
problems are a staple in undergraduate algorithm courses, where students are taught the
solution using Dijkstra’s algorithm. However, this famous algorithm assumes sequential
processing—how would we solve this problem in parallel, and more specifically, with
MapReduce?
4
This technique is used in anchor text inversion, where one gathers the anchor text of hyperlinks pointing to a
particular page. It is common practice to enrich a web page’s standard textual representation with all of the
anchor text associated with its incoming hyperlinks (e.g., [107]).
5.2. PARALLEL BREADTH-FIRST SEARCH 95
1: Dijkstra(G, w, s)
2: d[s] ← 0
3: for all vertex v ∈ V do
4: d[v] ← ∞
5: Q ← ¦V ¦
6: while Q ,= ∅ do
7: u ← ExtractMin(Q)
8: for all vertex v ∈ u.AdjacencyList do
9: if d[v] > d[u] + w(u, v) then
10: d[v] ← d[u] + w(u, v)
Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global
priority queue of nodes with priorities equal to their distances from the source node. At each
iteration, the algorithm expands the node with the shortest distance and updates distances to
all reachable nodes.
As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm is
shown in Figure 5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms
textbook [41] (often simply known as CLR). The input to the algorithm is a directed,
connected graph G = (V, E) represented with adjacency lists, w containing edge dis-
tances such that w(u, v) ≥ 0, and the source node s. The algorithm begins by first
setting distances to all vertices d[v], v ∈ V to ∞, except for the source node, whose
distance to itself is zero. The algorithm maintains Q, a global priority queue of vertices
with priorities equal to their distance values d.
Dijkstra’s algorithm operates by iteratively selecting the node with the lowest
current distance from the priority queue (initially, this is the source node). At each
iteration, the algorithm “expands” that node by traversing the adjacency list of the
selected node to see if any of those nodes can be reached with a path of a shorter
distance. The algorithm terminates when the priority queue Q is empty, or equivalently,
when all nodes have been considered. Note that the algorithm as presented in Figure 5.2
only computes the shortest distances. The actual paths can be recovered by storing
“backpointers” for every node indicating a fragment of the shortest path.
A sample trace of the algorithm running on a simple graph is shown in Figure 5.3
(example also adapted from CLR). We start out in (a) with n
1
having a distance of zero
(since it’s the source) and all other nodes having a distance of ∞. In the first iteration
(a), n
1
is selected as the node to expand (indicated by the thicker border). After the
expansion, we see in (b) that n
2
and n
3
can be reached at a distance of 10 and 5,
respectively. Also, we see in (b) that n
3
is the next node selected for expansion. Nodes
we have already considered for expansion are shown in black. Expanding n
3
, we see in
96 CHAPTER 5. GRAPH ALGORITHMS
∞ ∞
1
n
2
n
4
10 ∞
1
n
2
n
4
8 14
1
n
2
n
4
0
∞ ∞
10
5
2 3
9
4 6
n
1
0
10 ∞
10
5
2 3
9
4 6
n
1
0
8 14
10
5
2 3
9
4 6
n
1
∞ ∞
5
2
7
1
n
3
n
5
5 ∞
5
2
7
1
n
3
n
5
5 7
5
2
7
1
n
3
n
5
(a) (b) (c)
8 13
10
1
n
2
n
4
8 9
10
1
n
2
n
4
8 9
10
1
n
2
n
4
0
5 7
10
5
2 3
9
7
4 6
n
1
0
5 7
10
5
2 3
9
7
4 6
n
1
0
5 7
10
5
2 3
9
7
4 6
n
1
5 7
2
n
3
n
5
5 7
2
n
3
n
5
5 7
2
n
3
n
5
(d) (e) (f)
Figure 5.3: Example of Dijkstra’s algorithm applied to a simple graph with five nodes, with n
1
as the source and edge distances as indicated. Parts (a)–(e) show the running of the algorithm
at each iteration, with the current distance inside the node. Nodes with thicker borders are
those being expanded; nodes that have already been expanded are shown in black.
(c) that the distance to n
2
has decreased because we’ve found a shorter path. The nodes
that will be expanded next, in order, are n
5
, n
2
, and n
4
. The algorithm terminates with
the end state shown in (f), where we’ve discovered the shortest distance to all nodes.
The key to Dijkstra’s algorithm is the priority queue that maintains a globally-
sorted list of nodes by current distance. This is not possible in MapReduce, as the
programming model does not provide a mechanism for exchanging global data. Instead,
we adopt a brute force approach known as parallel breadth-first search. First, as a
simplification let us assume that all edges have unit distance (modeling, for example,
hyperlinks on the web). This makes the algorithm easier to understand, but we’ll relax
this restriction later.
The intuition behind the algorithm is this: the distance of all nodes connected
directly to the source node is one; the distance of all nodes directly connected to those
is two; and so on. Imagine water rippling away from a rock dropped into a pond—
that’s a good image of how parallel breadth-first search works. However, what if there
are multiple paths to the same node? Suppose we wish to compute the shortest distance
5.2. PARALLEL BREADTH-FIRST SEARCH 97
to node n. The shortest path must go through one of the nodes in M that contains an
outgoing edge to n: we need to examine all m ∈ M to find m
s
, the node with the shortest
distance. The shortest distance to n is the distance to m
s
plus one.
Pseudo-code for the implementation of the parallel breadth-first search algorithm
is provided in Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed
graph represented as adjacency lists. Distance to each node is directly stored alongside
the adjacency list of that node, and initialized to ∞ for all nodes except for the source
node. In the pseudo-code, we use n to denote the node id (an integer) and N to denote
the node’s corresponding data structure (adjacency list and current distance). The
algorithm works by mapping over all nodes and emitting a key-value pair for each
neighbor on the node’s adjacency list. The key contains the node id of the neighbor,
and the value is the current distance to the node plus one. This says: if we can reach node
n with a distance d, then we must be able to reach all the nodes that are connected to
n with distance d + 1. After shuffle and sort, reducers will receive keys corresponding to
the destination node ids and distances corresponding to all paths leading to that node.
The reducer will select the shortest of these distances and then update the distance in
the node data structure.
It is apparent that parallel breadth-first search is an iterative algorithm, where
each iteration corresponds to a MapReduce job. The first time we run the algorithm, we
“discover” all nodes that are connected to the source. The second iteration, we discover
all nodes connected to those, and so on. Each iteration of the algorithm expands the
“search frontier” by one hop, and, eventually, all nodes will be discovered with their
shortest distances (assuming a fully-connected graph). Before we discuss termination
of the algorithm, there is one more detail required to make the parallel breadth-first
search algorithm work. We need to “pass along” the graph structure from one iteration
to the next. This is accomplished by emitting the node data structure itself, with the
node id as a key (Figure 5.4, line 4 in the mapper). In the reducer, we must distinguish
the node data structure from distance values (Figure 5.4, lines 5–6 in the reducer), and
update the minimum distance in the node data structure before emitting it as the final
value. The final output is now ready to serve as input to the next iteration.
5
So how many iterations are necessary to compute the shortest distance to all
nodes? The answer is the diameter of the graph, or the greatest distance between any
pair of nodes. This number is surprisingly small for many real-world problems: the
saying “six degrees of separation” suggests that everyone on the planet is connected to
everyone else by at most six steps (the people a person knows are one step away, people
that they know are two steps away, etc.). If this is indeed true, then parallel breadth-
first search on the global social network would take at most six MapReduce iterations.
5
Note that in this algorithm we are overloading the value type, which can either be a distance (integer) or a
complex data structure representing a node. The best way to achieve this in Hadoop is to create a wrapper
object with an indicator variable specifying what the content is.
98 CHAPTER 5. GRAPH ALGORITHMS
1: class Mapper
2: method Map(nid n, node N)
3: d ← N.Distance
4: Emit(nid n, N) Pass along graph structure
5: for all nodeid m ∈ N.AdjacencyList do
6: Emit(nid m, d + 1) Emit distances to reachable nodes
1: class Reducer
2: method Reduce(nid m, [d
1
, d
2
, . . .])
3: d
min
← ∞
4: M ← ∅
5: for all d ∈ counts [d
1
, d
2
, . . .] do
6: if IsNode(d) then
7: M ← d Recover graph structure
8: else if d < d
min
then Look for shorter distance
9: d
min
← d
10: M.Distance ← d
min
Update shortest distance
11: Emit(nid m, node M)
Figure 5.4: Pseudo-code for parallel breath-first search in MapReduce: the mappers emit dis-
tances to reachable nodes, while the reducers select the minimum of those distances for each
destination node. Each iteration (one MapReduce job) of the algorithm expands the “search
frontier” by one hop.
For more serious academic studies of “small world” phenomena in networks, we refer
the reader to a number of publications [61, 62, 152, 2]. In practical terms, we iterate
the algorithm until there are no more node distances that are ∞. Since the graph is
connected, all nodes are reachable, and since all edge distances are one, all discovered
nodes are guaranteed to have the shortest distances (i.e., there is not a shorter path
that goes through a node that hasn’t been discovered).
The actual checking of the termination condition must occur outside of Map-
Reduce. Typically, execution of an iterative MapReduce algorithm requires a non-
MapReduce “driver” program, which submits a MapReduce job to iterate the algorithm,
checks to see if a termination condition has been met, and if not, repeats. Hadoop pro-
vides a lightweight API for constructs called “counters”, which, as the name suggests,
can be used for counting events that occur during execution, e.g., number of corrupt
records, number of times a certain condition is met, or anything that the programmer
desires. Counters can be defined to count the number of nodes that have distances of
∞: at the end of the job, the driver program can access the final counter value and
check to see if another iteration is necessary.
5.2. PARALLEL BREADTH-FIRST SEARCH 99
r
search frontier
s
p
q
Figure 5.5: In the single source shortest path problem with arbitrary edge distances, the
shortest path from source s to node r may go outside the current search frontier, in which case
we will not find the shortest distance to r until the search frontier expands to cover q.
Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel
breadth-first search algorithm only finds the shortest distances, not the actual shortest
paths. However, the path can be straightforwardly recovered. Storing “backpointers”
at each node, as with Dijkstra’s algorithm, will work, but may not be efficient since
the graph needs to be traversed again to reconstruct the path segments. A simpler
approach is to emit paths along with distances in the mapper, so that each node will
have its shortest path easily accessible at all times. The additional space requirements
for shuffling these data from mappers to reducers are relatively modest, since for the
most part paths (i.e., sequence of node ids) are relatively short.
Up until now, we have been assuming that all edges are unit distance. Let us relax
that restriction and see what changes are required in the parallel breadth-first search
algorithm. The adjacency lists, which were previously lists of node ids, must now encode
the edge distances as well. In line 6 of the mapper code in Figure 5.4, instead of emitting
d + 1 as the value, we must now emit d + w where w is the edge distance. No other
changes to the algorithm are required, but the termination behavior is very different.
To illustrate, consider the graph fragment in Figure 5.5, where s is the source node,
and in this iteration, we just “discovered” node r for the very first time. Assume for
the sake of argument that we’ve already discovered the shortest distance to node p, and
that the shortest distance to r so far goes through p. This, however, does not guarantee
that we’ve discovered the shortest distance to r, since there may exist a path going
through q that we haven’t encountered yet (because it lies outside the search frontier).
6
However, as the search frontier expands, we’ll eventually cover q and all other nodes
along the path from p to q to r—which means that with sufficient iterations, we will
discover the shortest distance to r. But how do we know that we’ve found the shortest
distance to p? Well, if the shortest path to p lies within the search frontier, we would
6
Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside
the search frontier since any such path would necessarily be longer.
100 CHAPTER 5. GRAPH ALGORITHMS
n
6
n
7
n
8
1
1
1
10
n
1
n
5
n
9
1
1
1
n
2
n
3
n
4
1
1
Figure 5.6: A sample graph that elicits worst-case behavior for parallel breadth-first search.
Eight iterations are required to discover shortest distances to all nodes from n
1
.
have already discovered it. And if it doesn’t, the above argument applies. Similarly, we
can repeat the same argument for all nodes on the path from s to p. The conclusion is
that, with sufficient iterations, we’ll eventually discover all the shortest distances.
So exactly how many iterations does “eventually” mean? In the worst case, we
might need as many iterations as there are nodes in the graph minus one. In fact, it
is not difficult to construct graphs that will elicit this worse-case behavior: Figure 5.6
provides an example, with n
1
as the source. The parallel breadth-first search algorithm
would not discover that the shortest path from n
1
to n
6
goes through n
3
, n
4
, and n
5
until the fifth iteration. Three more iterations are necessary to cover the rest of the
graph. Fortunately, for most real-world graphs, such extreme cases are rare, and the
number of iterations necessary to discover all shortest distances is quite close to the
diameter of the graph, as in the unit edge distance case.
In practical terms, how do we know when to stop iterating in the case of arbitrary
edge distances? The algorithm can terminate when shortest distances at every node no
longer change. Once again, we can use counters to keep track of such events. Every time
we encounter a shorter distance in the reducer, we increment a counter. At the end of
each MapReduce iteration, the driver program reads the counter value and determines
if another iteration is necessary.
Compared to Dijkstra’s algorithm on a single processor, parallel breadth-first
search in MapReduce can be characterized as a brute force approach that “wastes” a
lot of time performing computations whose results are discarded. At each iteration, the
algorithm attempts to recompute distances to all nodes, but in reality only useful work
is done along the search frontier: inside the search frontier, the algorithm is simply
repeating previous computations.
7
Outside the search frontier, the algorithm hasn’t
7
Unless the algorithm discovers an instance of the situation described in Figure 5.5, in which case, updated
distances will propagate inside the search frontier.
5.2. PARALLEL BREADTH-FIRST SEARCH 101
discovered any paths to nodes there yet, so no meaningful work is done. Dijkstra’s
algorithm, on the other hand, is far more efficient. Every time a node is explored, we’re
guaranteed to have already found the shortest path to it. However, this is made possible
by maintaining a global data structure (a priority queue) that holds nodes sorted by
distance—this is not possible in MapReduce because the programming model does not
provide support for global data that is mutable and accessible by the mappers and
reducers. These inefficiencies represent the cost of parallelization.
The parallel breadth-first search algorithm is instructive in that it represents the
prototypical structure of a large class of graph algorithms in MapReduce. They share
in the following characteristics:
• The graph structure is represented with adjacency lists, which is part of some larger
node data structure that may contain additional information (variables to store
intermediate output, features of the nodes). In many cases, features are attached
to edges as well (e.g., edge weights).
• The MapReduce algorithm maps over the node data structures and performs a
computation that is a function of features of the node, intermediate output at-
tached to each node, and features of the adjacency list (outgoing edges and their
features). In other words, computations can only involve a node’s internal state
and its local graph structure. The results of these computations are emitted as val-
ues, keyed with the node ids of the neighbors (i.e., those nodes on the adjacency
lists). Conceptually, we can think of this as “passing” the results of the computa-
tion along outgoing edges. In the reducer, the algorithm receives all partial results
that have the same destination node, and performs another computation (usually,
some form of aggregation).
• In addition to computations, the graph itself is also passed from the mapper to the
reducer. In the reducer, the data structure corresponding to each node is updated
and written back to disk.
• Graph algorithms in MapReduce are generally iterative, where the output of the
previous iteration serves as input to the next iteration. The process is controlled
by a non-MapReduce driver program that checks for termination.
For parallel breadth-first search, the mapper computation is the current distance plus
edge distance (emitting distances to neighbors), while the reducer computation is the
Min function (selecting the shortest path). As we will see in the next section, the
MapReduce algorithm for PageRank works in much the same way.
102 CHAPTER 5. GRAPH ALGORITHMS
5.3 PAGERANK
PageRank [117] is a measure of web page quality based on the structure of the hyperlink
graph. Although it is only one of thousands of features that is taken into account in
Google’s search algorithm, it is perhaps one of the best known and most studied.
A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer
visits a page, randomly clicks a link on that page, and repeats ad infinitum. Page-
Rank is a measure of how frequently a page would be encountered by our tireless web
surfer. More precisely, PageRank is a probability distribution over nodes in the graph
representing the likelihood that a random walk over the link structure will arrive at a
particular node. Nodes that have high in-degrees tend to have high PageRank values,
as well as nodes that are linked to by other nodes with high PageRank values. This
behavior makes intuitive sense: if PageRank is a measure of page quality, we would ex-
pect high-quality pages to contain “endorsements” from many other pages in the form
of hyperlinks. Similarly, if a high-quality page links to another page, then the second
page is likely to be high quality also. PageRank represents one particular approach to
inferring the quality of a web page based on hyperlink structure; two other popular
algorithms, not covered here, are SALSA [88] and HITS [84] (also known as “hubs and
authorities”).
The complete formulation of PageRank includes an additional component. As it
turns out, our web surfer doesn’t just randomly click links. Before the surfer decides
where to go next, a biased coin is flipped—heads, the surfer clicks on a random link on
the page as usual. Tails, however, the surfer ignores the links on the page and randomly
“jumps” or “teleports” to a completely different page.
But enough about random web surfing. Formally, the PageRank P of a page n is
defined as follows:
P(n) = α
_
1
[G[
_
+ (1 −α)

m∈L(n)
P(m)
C(m)
(5.1)
where [G[ is the total number of nodes (pages) in the graph, α is the random jump
factor, L(n) is the set of pages that link to n, and C(m) is the out-degree of node m
(the number of links on page m). The random jump factor α is sometimes called the
“teleportation” factor; alternatively, (1 −α) is referred to as the “damping” factor.
Let us break down each component of the formula in detail. First, note that
PageRank is defined recursively—this gives rise to an iterative algorithm we will detail
in a bit. A web page n receives PageRank “contributions” from all pages that link to
it, L(n). Let us consider a page m from the set of pages L(n): a random surfer at
m will arrive at n with probability 1/C(m) since a link is selected at random from all
outgoing links. Since the PageRank value of m is the probability that the random surfer
will be at m, the probability of arriving at n from m is P(m)/C(m). To compute the
5.3. PAGERANK 103
PageRank of n, we need to sum contributions from all pages that link to n. This is
the summation in the second half of the equation. However, we also need to take into
account the random jump: there is a 1/[G[ chance of landing at any particular page,
where [G[ is the number of nodes in the graph. Of course, the two contributions need to
be combined: with probability α the random surfer executes a random jump, and with
probability 1 −α the random surfer follows a hyperlink.
Note that PageRank assumes a community of honest users who are not trying to
“game” the measure. This is, of course, not true in the real world, where an adversarial
relationship exists between search engine companies and a host of other organizations
and individuals (marketers, spammers, activists, etc.) who are trying to manipulate
search results—to promote a cause, product, or service, or in some cases, to trap and
intentionally deceive users (see, for example, [12, 63]). A simple example is a so-called
“spider trap”, a infinite chain of pages (e.g., generated by CGI) that all link to a single
page (thereby artificially inflating its PageRank). For this reason, PageRank is only one
of thousands of features used in ranking web pages.
The fact that PageRank is recursively defined translates into an iterative algo-
rithm which is quite similar in basic structure to parallel breadth-first search. We start
by presenting an informal sketch. At the beginning of each iteration, a node passes its
PageRank contributions to other nodes that it is connected to. Since PageRank is a
probability distribution, we can think of this as spreading probability mass to neigh-
bors via outgoing links. To conclude the iteration, each node sums up all PageRank
contributions that have been passed to it and computes an updated PageRank score.
We can think of this as gathering probability mass passed to a node via its incoming
links. This algorithm iterates until PageRank values don’t change anymore.
Figure 5.7 shows a toy example that illustrates two iterations of the algorithm.
As a simplification, we ignore the random jump factor for now (i.e., α = 0) and further
assume that there are no dangling nodes (i.e., nodes with no outgoing edges). The
algorithm begins by initializing a uniform distribution of PageRank values across nodes.
In the beginning of the first iteration (top, left), partial PageRank contributions are sent
from each node to its neighbors connected via outgoing links. For example, n
1
sends
0.1 PageRank mass to n
2
and 0.1 PageRank mass to n
4
. This makes sense in terms of
the random surfer model: if the surfer is at n
1
with a probability of 0.2, then the surfer
could end up either in n
2
or n
4
with a probability of 0.1 each. The same occurs for all
the other nodes in the graph: note that n
5
must split its PageRank mass three ways,
since it has three neighbors, and n
4
receives all the mass belonging to n
3
because n
3
isn’t connected to any other node. The end of the first iteration is shown in the top
right: each node sums up PageRank contributions from its neighbors. Note that since
n
1
has only one incoming link, from n
3
, its updated PageRank value is smaller than
before, i.e., it “passed along” more PageRank mass than it received. The exact same
104 CHAPTER 5. GRAPH ALGORITHMS
It ti 1
n
1
(0.2)
n
2
(0.2)
0.1
0.1
0.1
0.1
n
1
(0.066)
n
2
(0.166) Iteration 1
n
3
(0.2)
n
5
(0.2)
0.2 0.2
0.066
0.066
0.066
n
3
(0.166)
n
5
(0.3)
n
4
(0.2) n
4
(0.3)
n
2
(0.166) n
2
(0.133) Iteration 2
n
1
(0.066)
0.033
0.033
0.083
0.083
0 1
0 1
0.1
n
1
(0.1)
n (0 3)
n
3
(0.166)
n
5
(0.3)
0.3
0.166
0.1
0.1
n (0 2)
n
3
(0.183)
n
5
(0.383)
n
4
(0.3) n
4
(0.2)
Figure 5.7: PageRank toy example showing two iterations, top and bottom. Left graphs show
PageRank values at the beginning of each iteration and how much PageRank mass is passed to
each neighbor. Right graphs show updated PageRank values at the end of each iteration.
process repeats, and the second iteration in our toy example is illustrated by the bottom
two graphs. At the beginning of each iteration, the PageRank values of all nodes sum
to one. PageRank mass is preserved by the algorithm, guaranteeing that we continue
to have a valid probability distribution at the end of each iteration.
Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.8; it is
simplified in that we continue to ignore the random jump factor and assume no dangling
nodes (complications that we will return to later). An illustration of the running algo-
rithm is shown in Figure 5.9 for the first iteration of the toy graph in Figure 5.7. The
algorithm maps over the nodes, and for each node computes how much PageRank mass
needs to be distributed to its neighbors (i.e., nodes on the adjacency list). Each piece
of the PageRank mass is emitted as the value, keyed by the node ids of the neighbors.
Conceptually, we can think of this as passing PageRank mass along outgoing edges.
In the shuffle and sort phase, the MapReduce execution framework groups values
(piece of PageRank mass) passed along the graph edges by destination node (i.e., all
edges that point to the same node). In the reducer, PageRank mass contributions from
all incoming edges are summed to arrive at the updated PageRank value for each node.
5.3. PAGERANK 105
1: class Mapper
2: method Map(nid n, node N)
3: p ← N.PageRank/[N.AdjacencyList[
4: Emit(nid n, N) Pass along graph structure
5: for all nodeid m ∈ N.AdjacencyList do
6: Emit(nid m, p) Pass PageRank mass to neighbors
1: class Reducer
2: method Reduce(nid m, [p
1
, p
2
, . . .])
3: M ← ∅
4: for all p ∈ counts [p
1
, p
2
, . . .] do
5: if IsNode(p) then
6: M ← p Recover graph structure
7: else
8: s ← s + p Sum incoming PageRank contributions
9: M.PageRank ← s
10: Emit(nid m, node M)
Figure 5.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the
random jump factor). In the map phase we evenly divide up each node’s PageRank mass and
pass each piece along outgoing edges to neighbors. In the reduce phase PageRank contributions
are summed up at each destination node. Each MapReduce job corresponds to one iteration of
the algorithm.
106 CHAPTER 5. GRAPH ALGORITHMS
n
1
(0.2)
n
2
(0.2)
0.1 0.1
0.1
n
1
(0.066)
n
2
(0.166) Iteration 1
1
( )
n (0 2)
n
5
(0.2)
0.1
0.066
0.066
0.066
1
( )
n (0 166)
n
5
(0.3)
n
4
(0.2)
n
3
(0.2)
0.2 0.2
n
4
(0.3)
n
3
(0.166)
n
5
[n
1
, n
2
, n
3
] n
1
[n
2
, n
4
] n
2
[n
3
, n
5
] n
3
[n
4
] n
4
[n
5
]
Map
n
2
n
4
n
3
n
5
n
1
n
2
n
3
n
4
n
5
Map
n
2
n
4
n
3
n
5
n
1
n
2
n
3
n
4
n
5
Reduce
n
5
[n
1
, n
2
, n
3
] n
1
[n
2
, n
4
] n
2
[n
3
, n
5
] n
3
[n
4
] n
4
[n
5
]
Figure 5.9: Illustration of the MapReduce PageRank algorithm corresponding to the first
iteration in Figure 5.7. The size of each box is proportion to its PageRank value. During the
map phase, PageRank mass is distributed evenly to nodes on each node’s adjacency list (shown
at the very top). Intermediate values are keyed by node (shown inside the boxes). In the reduce
phase, all partial PageRank contributions are summed together to arrive at updated values.
As with the parallel breadth-first search algorithm, the graph structure itself must be
passed from iteration to iteration. Each node data structure is emitted in the mapper
and written back out to disk in the reducer. All PageRank mass emitted by the mappers
are accounted for in the reducer: since we begin with the sum of PageRank values across
all nodes equal to one, the sum of all the updated PageRank values should remain a
valid probability distribution.
Having discussed the simplified PageRank algorithm in MapReduce, let us now
take into account the random jump factor and dangling nodes: as it turns out both are
treated similarly. Dangling nodes are nodes in the graph that have no outgoing edges,
i.e., their adjacency lists are empty. In the hyperlink graph of the web, these might
correspond to pages in a crawl that have not been downloaded yet. If we simply run
the algorithm in Figure 5.8 on graphs with dangling nodes, the total PageRank mass
will not be conserved, since no key-value pairs will be emitted when a dangling node is
encountered in the mappers.
The proper treatment of PageRank mass “lost” at the dangling nodes is to re-
distribute it across all nodes in the graph evenly (cf. [22]). There are many ways to
determine the missing PageRank mass. One simple approach is by instrumenting the
algorithm in Figure 5.8 with counters: whenever the mapper processes a node with an
empty adjacency list, it keeps track of the node’s PageRank value in the counter. At
the end of the iteration, we can access the counter to find out how much PageRank
5.3. PAGERANK 107
mass was lost at the dangling nodes.
8
Another approach is to reserve a special key for
storing PageRank mass from dangling nodes. When the mapper encounters a dangling
node, its PageRank mass is emitted with the special key; the reducer must be modified
to contain special logic for handling the missing PageRank mass. Yet another approach
is to write out the missing PageRank mass as “side data” for each map task (using
the in-mapper combining technique for aggregation); a final pass in the driver program
is needed to sum the mass across all map tasks. Either way, we arrive at the amount
of PageRank mass lost at the dangling nodes—this then must be redistribute evenly
across all nodes.
This redistribution process can be accomplished by mapping over all nodes again.
At the same time, we can take into account the random jump factor. For each node, its
current PageRank value p is updated to the final PageRank value p
t
according to the
following formula:
p
t
= α
_
1
[G[
_
+ (1 −α)
_
m
[G[
+ p
_
(5.2)
where m is the missing PageRank mass, and [G[ is the number of nodes in the entire
graph. We add the PageRank mass from link traversal (p, computed from before) to
the share of the lost PageRank mass that is distributed to each node (m/[G[). Finally,
we take into account the random jump factor: with probability α the random surfer
arrives via jumping, and with probability 1 −α the random surfer arrives via incoming
links. Note that this MapReduce job requires no reducers.
Putting everything together, one iteration of PageRank requires two MapReduce
jobs: the first to distribute PageRank mass along graph edges, and the second to take
care of dangling nodes and the random jump factor. At end of each iteration, we end
up with exactly the same data structure as the beginning, which is a requirement for
the iterative algorithm to work. Also, the PageRank values of all nodes sum up to one,
which ensures a valid probability distribution.
Typically, PageRank is iterated until convergence, i.e., when the PageRank values
of nodes no longer change (within some tolerance, to take into account, for example,
floating point precision errors). Therefore, at the end of each iteration, the PageRank
driver program must check to see if convergence has been reached. Alternative stopping
criteria include running a fixed number of iterations (useful if one wishes to bound
algorithm running time) or stopping when the ranks of PageRank values no longer
change. The latter is useful for some applications that only care about comparing the
PageRank of two arbitrary pages and do not need the actual PageRank values. Rank
stability is obtained faster than the actual convergence of values.
8
In Hadoop, counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large
constant, and then cast as an integer.
108 CHAPTER 5. GRAPH ALGORITHMS
In absolute terms, how many iterations are necessary for PageRank to converge?
This is a difficult question to precisely answer since it depends on many factors, but
generally, fewer than one might expect. In the original PageRank paper [117], conver-
gence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini
et al. [22] for additional discussion). On today’s web, the answer is not very meaningful
due to the adversarial nature of web search as previously discussed—the web is full
of spam and populated with sites that are actively trying to “game” PageRank and
related hyperlink-based metrics. As a result, running PageRank in its unmodified form
presented here would yield unexpected and undesirable results. Of course, strategies
developed by web search companies to combat link spam are proprietary (and closely-
guarded secrets, for obvious reasons)—but undoubtedly these algorithmic modifications
impact convergence behavior. A full discussion of the escalating “arms race” between
search engine companies and those that seek to promote their sites is beyond the scope
of this book.
9
5.4 ISSUES WITH GRAPH PROCESSING
The biggest difference between MapReduce graph algorithms and single-machine graph
algorithms is that with the latter, it is usually possible to maintain global data structures
in memory for fast, random access. For example, Dijkstra’s algorithm uses a global
priority queue that guides the expansion of nodes. This, of course, is not possible with
MapReduce—the programming model does not provide any built-in mechanism for
communicating global state. Since the most natural representation of large sparse graphs
is with adjacency lists, communication can only occur from a node to the nodes it links
to, or to a node from nodes linked to it—in other words, passing information is only
possible within the local graph structure.
10
This restriction gives rise to the structure of many graph algorithms in Map-
Reduce: local computation is performed on each node, the results of which are “passed”
to its neighbors. With multiple iterations, convergence on the global graph is possible.
The passing of partial results along a graph edge is accomplished by the shuffling and
sorting provided by the MapReduce execution framework. The amount of intermediate
data generated is on the order of the number of edges, which explains why all the algo-
rithms we have discussed assume sparse graphs. For dense graphs, MapReduce running
time would be dominated by copying intermediate data across the network, which in the
worst case is O(n
2
) in the number of nodes in the graph. Since MapReduce clusters are
9
For the interested reader, the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb)
provide great starting points into the literature.
10
Of course, it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example,
if one wishes to propagate information from a node to all nodes that are within two links, one could process
graph G to derive graph G

, where there would exist a link from node n
i
to n
j
if n
j
was reachable within two
link traversals of n
i
in the original graph G.
5.4. ISSUES WITH GRAPH PROCESSING 109
designed around commodity networks (e.g., gigabit Ethernet), MapReduce algorithms
are often impractical on large, dense graphs.
Combiners and the in-mapper combining pattern described in Section 3.1 can
be used to decrease the running time of graph iterations. It is straightforward to use
combiners for both parallel breadth-first search and PageRank since Min and sum,
used in the two algorithms, respectively, are both associative and commutative. How-
ever, combiners are only effective to the extent that there are opportunities for partial
aggregation—unless there are nodes pointed to by multiple nodes being processed by
an individual map task, combiners are not very useful. This implies that it would be
desirable to partition large graphs into smaller components where there are many intra-
component links and fewer inter-component links. This way, we can arrange the data
such that nodes in the same component are handled by the same map task—thus max-
imizing opportunities for combiners to perform local aggregation.
Unfortunately, this sometimes creates a chick-and-egg problem. It would be desir-
able to partition a large graph to facilitate efficient processing by MapReduce. But the
graph may be so large that we can’t partition it except with MapReduce algorithms!
Fortunately, in many cases there are simple solutions around this problem in the form
of “cheap” partitioning heuristics based on reordering the data [106]. For example, in
a social network, we might sort nodes representing users by zip code, as opposed to
by last name—based on the observation that friends tend to live close to each other.
Sorting by an even more cohesive property such as school would be even better (if
available): the probability of any two random students from the same school knowing
each other is much higher than two random students from different schools. Another
good example is to partition the web graph by the language of the page (since pages in
one language tend to link mostly to other pages in that language) or by domain name
(since inter-domain links are typically much denser than intra-domain links). Resorting
records using MapReduce is both easy to do and a relatively cheap operation—however,
whether the efficiencies gained by this crude form of partitioning are worth the extra
time taken in performing the resort is an empirical question that will depend on the
actual graph structure and algorithm.
Finally, there is a practical consideration to keep in mind when implementing
graph algorithms that estimate probability distributions over nodes (such as Page-
Rank). For large graphs, the probability of any particular node is often so small that
it underflows standard floating point representations. A very common solution to this
problem is to represent probabilities using their logarithms. When probabilities are
stored as logs, the product of two values is simply their sum. However, addition of
probabilities is also necessary, for example, when summing PageRank contribution for
a node. This can be implemented with reasonable precision as follows:
110 CHAPTER 5. GRAPH ALGORITHMS
a ⊕b =
_
b + log(1 + e
a−b
) a < b
a + log(1 + e
b−a
) a ≥ b
Furthermore, many math libraries include a log1p function which computes log(1 + x)
with higher precision than the na¨ıve implementation would have when x is very small
(as is often the case when working with probabilities). Its use may further improve the
accuracy of implementations that use log probabilities.
5.5 SUMMARY AND ADDITIONAL READINGS
This chapter covers graph algorithms in MapReduce, discussing in detail parallel
breadth-first search and PageRank. Both are instances of a large class of iterative algo-
rithms that share the following characteristics:
• The graph structure is represented with adjacency lists.
• Algorithms map over nodes and pass partial results to nodes on their adjacency
lists. Partial results are aggregated for each node in the reducer.
• The graph structure itself is passed from the mapper to the reducer, such that the
output is in the same form as the input.
• Algorithms are iterative and under the control of a non-MapReduce driver pro-
gram, which checks for termination at the end of each iteration.
The MapReduce programming model does not provide a mechanism to maintain global
data structures accessible and mutable by all the mappers and reducers.
11
One impli-
cation of this is that communication between pairs of arbitrary nodes is difficult to
accomplish. Instead, information typically propagates along graph edges—which gives
rise to the structure of algorithms discussed above.
Additional Readings. The ubiquity of large graphs translates into substantial inter-
est in scalable graph algorithms using MapReduce in industry, academia, and beyond.
There is, of course, much beyond what has been covered in this chapter. For additional
material, we refer readers to the following: Kang et al. [80] presented an approach to
estimating the diameter of large graphs using MapReduce and a library for graph min-
ing [81]; Cohen [39] discussed a number of algorithms for processing undirected graphs,
with social network analysis in mind; Rao and Yarowsky [128] described an implemen-
tation of label propagation, a standard algorithm for semi-supervised machine learning,
on graphs derived from textual data; Schatz [132] tackled the problem of DNA sequence
11
However, maintaining globally-synchronized state may be possible with the assistance of other tools (e.g., a
distributed database).
5.5. SUMMARY AND ADDITIONAL READINGS 111
alignment and assembly with graph algorithms in MapReduce. Finally, it is easy to for-
get that parallel graph algorithms have been studied by computer scientists for several
decades, particular in the PRAM model [77, 60]. It is not clear, however, to what extent
well-known PRAM algorithms translate naturally into the MapReduce framework.
112
C H A P T E R 6
EM Algorithms for Text Processing
Until the end of the 1980s, text processing systems tended to rely on large numbers
of manually written rules to analyze, annotate, and transform text input, usually in
a deterministic way. This rule-based approach can be appealing: a system’s behavior
can generally be understood and predicted precisely, and, when errors surface, they can
be corrected by writing new rules or refining old ones. However, rule-based systems
suffer from a number of serious problems. They are brittle with respect to the natural
variation found in language, and developing systems that can deal with inputs from
diverse domains is very labor intensive. Furthermore, when these systems fail, they
often do so catastrophically, unable to offer even a “best guess” as to what the desired
analysis of the input might be.
In the last 20 years, the rule-based approach has largely been abandoned in favor
of more data-driven methods, where the “rules” for processing the input are inferred
automatically from large corpora of examples, called training data. The basic strategy of
the data-driven approach is to start with a processing algorithm capable of capturing
how any instance of the kinds of inputs (e.g., sentences or emails) can relate to any
instance of the kinds of outputs that the final system should produce (e.g., the syntactic
structure of the sentence or a classification of the email as spam). At this stage, the
system can be thought of as having the potential to produce any output for any input,
but they are not distinguished in any way. Next, a learning algorithm is applied which
refines this process based on the training data—generally attempting to make the model
perform as well as possible at predicting the examples in the training data. The learning
process, which often involves iterative algorithms, typically consists of activities like
ranking rules, instantiating the content of rule templates, or determining parameter
settings for a given model. This is known as machine learning, an active area of research.
Data-driven approaches have turned out to have several benefits over rule-based
approaches to system development. Since data-driven systems can be trained using
examples of the kind that they will eventually be used to process, they tend to deal
with the complexities found in real data more robustly than rule-based systems do.
Second, developing training data tends to be far less expensive than developing rules. For
some applications, significant quantities of training data may even exist for independent
reasons (e.g., translations of text into multiple languages are created by authors wishing
to reach an audience speaking different languages, not because they are generating
training data for a data-driven machine translation system). These advantages come
at the cost of systems that often behave internally quite differently than a human-
113
engineered system. As a result, correcting errors that the trained system makes can be
quite challenging.
Data-driven information processing systems can be constructed using a variety
of mathematical techniques, but in this chapter we focus on statistical models, which
probabilistically relate inputs from an input set A (e.g., sentences, documents, etc.),
which are always observable, to annotations from a set ¸, which is the space of possible
annotations or analyses that the system should predict. This model may take the form
of either a joint model Pr(x, y) which assigns a probability to every pair ¸x, y) ∈ A
¸ or a conditional model Pr(y[x), which assigns a probability to every y ∈ ¸, given
an x ∈ A. For example, to create a statistical spam detection system, we might have
¸ = ¦Spam, NotSpam¦ and A be the set of all possible email messages. For machine
translation, A might be the set of Arabic sentences and ¸ the set of English sentences.
1
There are three closely related, but distinct challenges in statistical text-
processing. The first is model selection. This entails selecting a representation of a
joint or conditional distribution over the desired A and ¸. For a problem where A and
¸ are very small, one could imagine representing these probabilities in look-up tables.
However, for something like email classification or machine translation, where the model
space is infinite, the probabilities cannot be represented directly, and must be computed
algorithmically. As an example of such models, we introduce hidden Markov models
(HMMs), which define a joint distribution over sequences of inputs and sequences of
annotations. The second challenge is parameter estimation or learning, which involves
the application of a optimization algorithm and training criterion to select the param-
eters of the model to optimize the model’s performance (with respect to the given
training criterion) on the training data.
2
The parameters of a statistical model are
the values used to compute the probability of some event described by the model. In
this chapter we will focus on one particularly simple training criterion for parameter
estimation, maximum likelihood estimation, which says to select the parameters that
make the training data most probable under the model, and one learning algorithm
that attempts to meet this criterion, called expectation maximization (EM). The final
challenge for statistical modeling is the problem of decoding, or, given some x, using the
model to select an annotation y. One very common strategy is to select y according to
the following criterion:
y

= arg max
y∈¸
Pr(y[x)
1
In this chapter, we will consider discrete models only. They tend to be sufficient for text processing, and their
presentation is simpler than models with continuous densities. It should be kept in mind that the sets X and
Y may still be countably infinite.
2
We restrict our discussion in this chapter to models with finite numbers of parameters and where the learning
process refers to setting those parameters. Inference in and learning of so-called nonparameteric models, which
have an infinite number of parameters and have become important statistical models for text processing in
recent years, is beyond the scope of this chapter.
114 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
In a conditional (or direct) model, this is a straightforward search for the best y un-
der the model. In a joint model, the search is also straightforward, on account of the
definition of conditional probability:
y

= arg max
y∈¸
Pr(y[x) = arg max
y∈¸
Pr(x, y)

y
Pr(x, y
t
)
= arg max
y∈¸
Pr(x, y)
The specific form that the search takes will depend on how the model is represented.
Our focus in this chapter will primarily be on the second problem: learning parameters
for models, but we will touch on the third problem as well.
Machine learning is often categorized as either supervised or unsupervised. Su-
pervised learning of statistical models simply means that the model parameters are
estimated from training data consisting of pairs of inputs and annotations, that is
Z = ¸¸x
1
, y
1
), ¸x
2
, y
2
), . . .) where ¸x
i
, y
i
) ∈ A ¸ and y
i
is the gold standard (i.e., cor-
rect) annotation of x
i
. While supervised models often attain quite good performance,
they are often uneconomical to use, since the training data requires each object that
is to be classified (to pick a specific task), x
i
to be paired with its correct label, y
i
.
In many cases, these gold standard training labels must be generated by a process of
expert annotation, meaning that each x
i
must be manually labeled by a trained indi-
vidual. Even when the annotation task is quite simple for people to carry out (e.g., in
the case of spam detection), the number of potential examples that could be classified
(representing a subset of A, which may of course be infinite in size) will far exceed
the amount of data that can be annotated. As the annotation task becomes more com-
plicated (e.g., when predicting more complex structures such as sequences of labels or
when the annotation task requires specialized expertise), annotation becomes far more
challenging.
Unsupervised learning, on the other hand, requires only that the training data
consist of a representative collection of objects that should be annotated, that is
Z = ¸x
1
, x
2
, . . .) where x
i
∈ A, but without any example annotations. While it may
at first seem counterintuitive that meaningful annotations can be learned without any
examples of the desired annotations being given, the learning criteria and model struc-
ture (which crucially define the space of possible annotations ¸ and the process by
which annotations relate to observable inputs) make it possible to induce annotations
by relying on regularities in the unclassified training instances. While a thorough discus-
sion of unsupervised learning is beyond the scope of this book, we focus on a particular
class of algorithms—expectation maximization (EM) algorithms—that can be used to
learn the parameters of a joint model Pr(x, y) from incomplete data (i.e., data where
some of the variables in the model cannot be observed; in the case of unsupervised
learning, the y
i
’s are unobserved). Expectation maximization algorithms fit naturally
into the MapReduce paradigm, and are used to solve a number of problems of interest
in text processing. Furthermore, these algorithms can be quite computationally expen-
6.1. EXPECTATION MAXIMIZATION 115
sive, since they generally require repeated evaluations of the training data. MapReduce
therefore provides an opportunity not only to scale to larger amounts of data, but also
to improve efficiency bottlenecks at scales where non-parallel solutions could be utilized.
This chapter is organized as follows. In Section 6.1, we describe maximum likeli-
hood estimation for statistical models, show how this is generalized to models where not
all variables are observable, and then introduce expectation maximization (EM). We
describe hidden Markov models (HMMs) in Section 6.2, a very versatile class of models
that uses EM for parameter estimation. Section 6.3 discusses how EM algorithms can be
expressed in MapReduce, and then in Section 6.4 we look at a case study of word align-
ment for statistical machine translation. Section 6.5 examines similar algorithms that
are appropriate for supervised learning tasks. This chapter concludes with a summary
and pointers to additional readings.
6.1 EXPECTATION MAXIMIZATION
Expectation maximization (EM) algorithms [49] are a family of iterative optimization
algorithms for learning probability distributions from incomplete data. They are ex-
tensively used in statistical natural language processing where one seeks to infer latent
linguistic structure from unannotated text. To name just a few applications, EM algo-
rithms have been used to find part-of-speech sequences, constituency and dependency
trees, alignments between texts in different languages, alignments between acoustic sig-
nals and their transcriptions, as well as for numerous other clustering and structure
discovery problems.
Expectation maximization generalizes the principle of maximum likelihood esti-
mation to the case where the values of some variables are unobserved (specifically, those
characterizing the latent structure that is sought).
6.1.1 MAXIMUM LIKELIHOOD ESTIMATION
Maximum likelihood estimation (MLE) is a criterion for fitting the parameters θ of
a statistical model to some given data x. Specifically, it says to select the parameter
settings θ

such that the likelihood of observing the training data given the model is
maximized:
θ

= arg max
θ
Pr(X = x; θ) (6.1)
To illustrate, consider the simple marble game shown in Figure 6.1. In this game,
a marble is released at the position indicated by the black dot, and it bounces down
into one of the cups at the bottom of the board, being diverted to the left or right by
the peg (indicated by a triangle) in the center. Our task is to construct a model that
predicts which cup the ball will drop into. A “rule-based” approach might be to take
116 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
a b
0 1
0 1 2 3
a b
0
c
1 2 3
Figure 6.1: A simple marble game where a released marble takes one of two possible paths.
This game can be modeled using a Bernoulli random variable with parameter p, which indicates
the probability that the marble will go to the right when it hits the peg.
exact measurements of the board and construct a physical model that we can use to
predict the behavior of the ball. Given sophisticated enough measurements, this could
certainly lead to a very accurate model. However, the construction of this model would
be quite time consuming and difficult.
A statistical approach, on the other hand, might be to assume that the behavior
of the marble in this game can be modeled using a Bernoulli random variable Y with
parameter p. That is, the value of the random variable indicates whether path 0 or 1 is
taken. We also define a random variable X whose value is the label of the cup that the
marble ends up in; note that X is deterministically related to Y , so an observation of
X is equivalent to an observation of Y .
To estimate the parameter p of the statistical model of our game, we need
some training data, so we drop 10 marbles into the game which end up in cups
x = ¸b, b, b, a, b, b, b, b, b, a).
What is the maximum likelihood estimate of p given this data? By assuming
that our samples are independent and identically distributed (i.i.d.), we can write the
likelihood of our data as follows:
3
Pr(x; p) =
10

j=1
p
δ(xj,a)
(1 −p)
δ(xj,b)
= p
2
(1 −p)
8
Since log is a monotonically increasing function, maximizing log Pr(x; p) will give us
the desired result. We can do this differentiating with respect to p and finding where
3
In this equation, δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0
otherwise.
6.1. EXPECTATION MAXIMIZATION 117
the resulting expression equals 0:
d log Pr(x; p)
dp
= 0
d[2 log p + 8 log(1 −p)]
dp
= 0
2
p

8
1 −p
= 0
Solving for p yields 0.2, which is the intuitive result. Furthermore, it is straightforward
to show that in N trials where N
0
marbles followed path 0 to cup a, and N
1
marbles
followed path 1 to cup b, the maximum likelihood estimate of p is N
1
/(N
0
+ N
1
).
While this model only makes use of an approximation of the true physical process
at work when the marble interacts with the game board, it is an empirical question
whether the model works well enough in practice to be useful. Additionally, while a
Bernoulli trial is an extreme approximation of the physical process, if insufficient re-
sources were invested in building a physical model, the approximation may perform
better than the more complicated “rule-based” model. This sort of dynamic is found
often in text processing problems: given enough data, astonishingly simple models can
outperform complex knowledge-intensive models that attempt to simulate complicated
processes.
6.1.2 A LATENT VARIABLE MARBLE GAME
To see where latent variables might come into play in modeling, consider a more com-
plicated variant of our marble game shown in Figure 6.2. This version consists of three
pegs that influence the marble’s path, and the marble may end up in one of three cups.
Note that both paths 1 and 2 lead to cup b.
To construct a statistical model of this game, we again assume that the behavior of
a marble interacting with a peg can be modeled with a Bernoulli random variable. Since
there are three pegs, we have three random variables with parameters θ = ¸p
0
, p
1
, p
2
),
corresponding to the probabilities that the marble will go to the right at the top,
left, and right pegs. We further define a random variable X taking on values from
¦a, b, c¦ indicating what cup the marble ends in, and Y , taking on values from ¦0, 1, 2, 3¦
indicating which path was taken. Note that the full joint distribution Pr(X = x, Y = y)
is determined by θ.
How should the parameters θ be estimated? If it were possible to observe the
paths taken by marbles as they were dropped into the game, it would be trivial to
estimate the parameters for our model using the maximum likelihood estimator—we
would simply need to count the number of times the marble bounced left or right at
each peg. If N
x
counts the number of times a marble took path x in N trials, this is:
118 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
a b
0 1
0 1 2 3
a b
0
c
1 2 3
Figure 6.2: A more complicated marble game where the released marble takes one of four
possible paths. We assume that we can only observe which cup the marble ends up in, not the
specific path taken.
p
0
=
N
2
+ N
3
N
p
1
=
N
1
N
0
+ N
1
p
2
=
N
3
N
2
+ N
3
However, we wish to consider the case where the paths taken are unobservable (imagine
an opaque sheet covering the center of the game board), but where we can see what cup a
marble ends in. In other words, we want to consider the case where we have partial data.
This is exactly the problem encountered in unsupervised learning: there is a statistical
model describing the relationship between two sets of variables (X’s and Y ’s), and there
is data available from just one of them. Furthermore, such algorithms are quite useful in
text processing, where latent variables may describe latent linguistic structures of the
observed variables, such as parse trees or part-of-speech tags, or alignment structures
relating sets of observed variables (see Section 6.4).
6.1.3 MLE WITH LATENT VARIABLES
Formally, we consider the problem of estimating parameters for statistical models of
the form Pr(X, Y ; θ) which describe not only an observable variable X but a latent, or
hidden, variable Y .
In these models, since only the values of the random variable X are observable,
we define our optimization criterion to be the maximization of the marginal likelihood,
that is, summing over all settings of the latent variable Y , which takes on values from
set designated ¸:
4
Again, we assume that samples in the training data x are i.i.d.:
4
For this description, we assume that the variables in our model take on discrete values. Not only does this
simplify exposition, but discrete models are widely used in text processing.
6.1. EXPECTATION MAXIMIZATION 119
Pr(X = x) =

y∈¸
Pr(X = x, Y = y; θ)
For a vector of training observations x = ¸x
1
, x
2
, . . . , x

), if we assume the samples are
i.i.d.:
Pr(x; θ) =
]x]

j=1

y∈¸
Pr(X = x
j
, Y = y; θ)
Thus, the maximum (marginal) likelihood estimate of the model parameters θ

given a
vector of i.i.d. observations x becomes:
θ

= arg max
θ
]x]

j=1

y∈¸
Pr(X = x
j
, Y = y; θ)
Unfortunately, in many cases, this maximum cannot be computed analytically, but the
iterative hill-climbing approach of expectation maximization can be used instead.
6.1.4 EXPECTATION MAXIMIZATION
Expectation maximization (EM) is an iterative algorithm that finds a successive series
of parameter estimates θ
(0)
, θ
(1)
, . . . that improve the marginal likelihood of the training
data. That is, EM guarantees:
]x]

j=1

y∈¸
Pr(X = x
j
, Y = y; θ
(i+1)
) ≥
]x]

j=1

y∈¸
Pr(X = x
j
, Y = y; θ
(i)
)
The algorithm starts with some initial set of parameters θ
(0)
and then updates them
using two steps: expectation (E-step), which computes the posterior distribution over
the latent variables given the observable data x and a set of parameters θ
(i)
,
5
and
maximization (M-step), which computes new parameters θ
(i+1)
maximizing the expected
log likelihood of the joint distribution with respect to the distribution computed in the
E-step. The process then repeats with these new parameters. The algorithm terminates
when the likelihood remains unchanged.
6
In more detail, the steps are as follows:
5
The term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x; θ
(i)
)
that are required to solve the M-step have the form of an expectation (with respect to this distribution).
6
The final solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be
the global maximum.
120 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
E-step. Compute the posterior probability of each possible hidden variable assign-
ments y ∈ ¸ for each x ∈ A and the current parameter settings, weighted by the rela-
tive frequency with which x occurs in x. Call this q(X = x, Y = y; θ
(i)
) and note that
it defines a joint probability distribution over A ¸ in that

(x,y)∈··¸
q(x, y) = 1.
q(x, y; θ
(i)
) = f(x[x) Pr(Y = y[X = x; θ
(i)
) = f(x[x)
Pr(x, y; θ
(i)
)

y
Pr(x, y
t
; θ
(i)
)
M-step. Compute new parameter settings that maximize the expected log of the
probability of the joint distribution under the q-distribution that was computed in the
E-step:
θ
(i+1)
= arg max
θ

E
q(X=x,Y =y;θ
(i)
)
log Pr(X = x, Y = y; θ
t
)
= arg max
θ

(x,y)∈··¸
q(X = x, Y = y; θ
(i)
) log Pr(X = x, Y = y; θ
t
)
We omit the proof that the model with parameters θ
(i+1)
will have equal or greater
marginal likelihood on the training data than the model with parameters θ
(i)
, but this
is provably true [78].
Before continuing, we note that the effective application of expectation maximiza-
tion requires that both the E-step and the M-step consist of tractable computations.
Specifically, summing over the space of hidden variable assignments must not be in-
tractable. Depending on the independence assumptions made in the model, this may
be achieved through techniques such as dynamic programming. However, some models
may require intractable computations.
6.1.5 AN EM EXAMPLE
Let’s look at how to estimate the parameters from our latent variable marble game from
Section 6.1.2 using EM. We assume training data x consisting of N = [x[ observations of
X with N
a
, N
b
, and N
c
indicating the number of marbles ending in cups a, b, and c. We
start with some parameters θ
(0)
= ¸p
(0)
0
, p
(0)
1
, p
(0)
2
) that have been randomly initialized
to values between 0 and 1.
E-step. We need to compute the distribution q(X = x, Y = y; θ
(i)
), as defined above.
We first note that the relative frequency f(x[x) is:
f(x[x) =
N
x
N
6.2. HIDDEN MARKOV MODELS 121
Next, we observe that Pr(Y = 0[X = a) = 1 and Pr(Y = 3[X = c) = 1 since cups a and
c fully determine the value of the path variable Y . The posterior probability of paths 1
and 2 are only non-zero when X is b:
Pr(1[b; θ
(i)
) =
(1 −p
(i)
0
)p
(i)
1
(1 −p
(i)
0
)p
(i)
1
+ p
(i)
0
(1 −p
(i)
2
)
Pr(2[b; θ
(i)
) =
p
(i)
0
(1 −p
(i)
2
)
(1 −p
(i)
0
)p
(i)
1
+ p
(i)
0
(1 −p
(i)
2
)
Except for the four cases just described, Pr(Y = y[X = x) is zero for all other values of
x and y (regardless of the value of the parameters).
M-step. We now need to maximize the expectation of log Pr(X, Y ; θ
t
) (which will
be a function in terms of the three parameter variables) under the q-distribution we
computed in the E step. The non-zero terms in the expectation are as follows:
x y q(X = x, Y = y; θ
(i)
) log Pr(X = x, Y = y; θ
t
)
a 0 N
a
/N log(1 −p
t
0
) + log(1 −p
t
1
)
b 1 N
b
/N Pr(1[b; θ
(i)
) log(1 −p
t
0
) + log p
t
1
b 2 N
b
/N Pr(2[b; θ
(i)
) log p
t
0
+ log(1 −p
t
2
)
c 3 N
c
/N log p
t
0
+ log p
t
2
Multiplying across each row and adding from top to bottom yields the expectation we
wish to maximize. Each parameter can be optimized independently using differentiation.
The resulting optimal values are expressed in terms of the counts in x and θ
(i)
:
p
0
=
Pr(2[b; θ
(i)
) N
b
+ N
c
N
p
1
=
Pr(1[b; θ
(i)
) N
b
N
a
+ Pr(1[b; θ
(i)
) N
b
p
2
=
N
c
Pr(2[b; θ
(i)
) N
b
+ N
c
It is worth noting that the form of these expressions is quite similar to the fully observed
maximum likelihood estimate. However, rather than depending on exact path counts,
the statistics used are the expected path counts, given x and parameters θ
(i)
.
Typically, the values computed at the end of the M-step would serve as new
parameters for another iteration of EM. However, the example we have presented here
is quite simple and the model converges to a global optimum after a single iteration. For
most models, EM requires several iterations to converge, and it may not find a global
optimum. And since EM only finds a locally optimal solution, the final parameter values
depend on the values chose for θ
(0)
.
6.2 HIDDEN MARKOV MODELS
To give a more substantial and useful example of models whose parameters may be
estimated using EM, we turn to hidden Markov models (HMMs). HMMs are models of
122 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
data that are ordered sequentially (temporally, from left to right, etc.), such as words
in a sentence, base pairs in a gene, or letters in a word. These simple but powerful
models have been used in applications as diverse as speech recognition [78], information
extraction [139], gene finding [143], part of speech tagging [44], stock market forecasting
[70], text retrieval [108], and word alignment of parallel (translated) texts [150] (more
in Section 6.4).
In an HMM, the data being modeled is posited to have been generated from an
underlying Markov process, which is a stochastic process consisting of a finite set of
states where the probability of entering a state at time t + 1 depends only on the state
of the process at time t [130]. Alternatively, one can view a Markov process as a prob-
abilistic variant of a finite state machine, where transitions are taken probabilistically.
As another point of comparison, the PageRank algorithm considered in the previous
chapter (Section 5.3) can be understood as a Markov process: the probability of follow-
ing any link on a particular page is independent of the path taken to reach that page.
The states of this Markov process are, however, not directly observable (i.e., hidden).
Instead, at each time step, an observable token (e.g., a word, base pair, or letter) is
emitted according to a probability distribution conditioned on the identity of the state
that the underlying process is in.
A hidden Markov model /is defined as a tuple ¸o, O, θ). o is a finite set of states,
which generate symbols from a finite observation vocabulary O. Following convention,
we assume that variables q, r, and s refer to states in o, and o refers to symbols in
the observation vocabulary O. This model is parameterized by the tuple θ = ¸A, B, π)
consisting of an [o[ [o[ matrix A of transition probabilities, where A
q
(r) gives the
probability of transitioning from state q to state r; an [o[ [O[ matrix B of emission
probabilities, where B
q
(o) gives the probability that symbol o will be emitted from
state q; and an [o[-dimensional vector π, where π
q
is the probability that the process
starts in state q.
7
These matrices may be dense, but for many applications sparse
parameterizations are useful. We further stipulate that A
q
(r) ≥ 0, B
q
(o) ≥ 0, and π
q
≥ 0
for all q, r, and o, as well as that:

r∈S
A
q
(r) = 1 ∀q

o∈c
B
q
(o) = 1 ∀q

q∈S
π
q
= 1
A sequence of observations of length τ is generated as follows:
Step 0, let t = 1 and select an initial state q according to the distribution π.
Step 1, an observation symbol from O is emitted according to the distribution B
q
.
7
This is only one possible definition of an HMM, but it is one that is useful for many text processing problems. In
alternative definitions, initial and final states may be handled differently, observations may be emitted during
the transition between states, or continuous-valued observations may be emitted (for example, from a Gaussian
distribution).
6.2. HIDDEN MARKOV MODELS 123
Step 2, a new q is drawn according to the distribution A
q
.
Step 3, t is incremented, and if t ≤ τ, the process repeats from Step 1.
Since all events generated by this process are conditionally independent, the joint prob-
ability of this sequence of observations and the state sequence used to generate it is the
product of the individual event probabilities.
Figure 6.3 shows a simple example of a hidden Markov model for part-of-
speech tagging, which is the task of assigning to each word in an input sentence
its grammatical category (one of the first steps in analyzing textual content). States
o = ¦det, adj, nn, v¦ correspond to the parts of speech (determiner, adjective, noun,
and verb), and observations O = ¦the, a, green, . . .¦ are a subset of English words. This
example illustrates a key intuition behind many applications of HMMs: states corre-
spond to equivalence classes or clustering of observations, and a single observation type
may associated with several clusters (in this example, the word wash can be generated
by an nn or v, since wash can either be a noun or a verb).
6.2.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
8
1. Given a model / = ¸o, O, θ), and an observation sequence of symbols from O,
x = ¸x
1
, x
2
, . . . , x
τ
), what is the probability that / generated the data (summing
over all possible state sequences, ¸)?
Pr(x) =

y∈¸
Pr(x, y; θ)
2. Given a model / = ¸o, O, θ) and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈¸
Pr(x, y; θ)
3. Given a set of states o, an observation vocabulary O, and a series of i.i.d.
observation sequences ¸x
1
, x
2
, . . . , x

), what are the parameters θ = ¸A, B, π) that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈¸
Pr(x
i
, y; θ)
Using our definition of an HMM, the answers to the first two questions are in principle
quite trivial to compute: by iterating over all state sequences ¸, the probability that
8
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125].
124 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
0 0 0 0.5
0.3 0.2 0.1 0.2
0.7 0.7 0.4 0.2
0 0.1 0.5 0.1
DET
ADJ
NN
V
DET
the
a
0.7
0.3
ADJ
green
big
old
might
0.1
0.4
0.4
NN
book
plants
people
person
John
wash
0.3
0.2
0.2
0.1
V
might
wash
washes
loves
reads
books
0.2
0.3
0.2
0.01
DET ADJ NN V
DET
ADJ
NN
V
Transition probabilities:
Emission probabilities:
0.1
0.19 0.1
Example outputs:
John might wash
the big green person loves old plants
NN V V
DET ADJ ADJ NN V ADJ NN
plants washes books books books
NN V V NN V
Initial probabilities:
DET ADJ NN V
0.5 0.3 0.1 0.1
0.1
0.1
Figure 6.3: An example HMM that relates part-of-speech tags to vocabulary items in an
English-like language. Possible (probability > 0) transitions for the Markov process are shown
graphically. In the example outputs, the state sequences corresponding to the emissions are
written beneath the emitted symbols.
6.2. HIDDEN MARKOV MODELS 125
each generated x can be computed by looking up and multiplying the relevant prob-
abilities in A, B, and π, and then summing the result or taking the maximum. And,
as we hinted at in the previous section, the third question can be answered using EM.
Unfortunately, even with all the distributed computing power MapReduce makes avail-
able, we will quickly run into trouble if we try to use this na¨ıve strategy since there are
[o[
τ
distinct state sequences of length τ, making exhaustive enumeration computation-
ally intractable. Fortunately, because the underlying model behaves exactly the same
whenever it is in some state, regardless of how it got to that state, we can use dynamic
programming algorithms to answer all of the above questions without summing over
exponentially many sequences.
6.2.2 THE FORWARD ALGORITHM
Given some observation sequence, for example x = ¸John, might, wash), Question 1 asks
what is the probability that this sequence was generated by an HMM / = ¸o, O, θ).
For the purposes of illustration, we assume that / is defined as shown in Figure 6.3.
There are two ways to compute the probability of x having been generated by
/. The first is to compute the sum over the joint probability of x and every possible
labeling y
t
∈ ¦¸det, det, det), ¸det, det, nn), ¸det, det, v), . . .¦. As indicated above
this is not feasible for most sequences, since the set of possible labels is exponential in
the length of x. The second, fortunately, is much more efficient.
We can make use of what is known as the forward algorithm to compute the
desired probability in polynomial time. We assume a model / = ¸o, O, θ) as defined
above. This algorithm works by recursively computing the answer to a related question:
what is the probability that the process is in state q at time t and has generated
¸x
1
, x
2
, . . . , x
t
)? Call this probability α
t
(q). Thus, α
t
(q) is a two dimensional matrix (of
size [x[ [o[), called a trellis. It is easy to see that the values of α
1
(q) can be computed
as the product of two independent probabilities: the probability of starting in state q
and the probability of state q generating x
1
:
α
1
(q) = π
q
B
q
(x
1
)
From this, it’s not hard to see that the values of α
2
(r) for every r can be computed in
terms of the [o[ values in α
1
() and the observation x
2
:
α
2
(r) = B
r
(x
2
)

q∈S
α
1
(q) A
q
(r)
This works because there are [o[ different ways to get to state r at time t = 2: starting
from state 1, 2, . . . , [o[ and transitioning to state r. Furthermore, because the behavior
126 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
of a Markov process is determined only by the state it is in at some time (not by how
it got to that state), α
t
(r) can always be computed in terms of the [o[ values in α
t−1
()
and the observation x
t
:
α
t
(r) = B
r
(x
t
)

q∈S
α
t−1
(q) A
q
(r)
We have now shown how to compute the probability of being in any state q at any time
t, having generated ¸x
1
, x
2
, . . . , x
t
), with the forward algorithm. The probability of the
full sequence is the probability of being in time [x[ and in any state, so the answer to
Question 1 can be computed simply by summing over α values at time [x[ for all states:
Pr(x; θ) =

q∈S
α
]x]
(q)
In summary, there are two ways of computing the probability that a sequence of obser-
vations x was generated by /: exhaustive enumeration with summing and the forward
algorithm. Figure 6.4 illustrates the two possibilities. The upper panel shows the na¨ıve
exhaustive approach, enumerating all 4
3
possible labels y
t
of x and computing their joint
probability Pr(x, y
t
). Summing over all y
t
, the marginal probability of x is found to be
0.00018. The lower panel shows the forward trellis, consisting of 4 3 cells. Summing
over the final column also yields 0.00018, the same result.
6.2.3 THE VITERBI ALGORITHM
Given an observation sequence x, the second question we might want to ask of / is:
what is the most likely sequence of states that generated the observations? As with the
previous question, the na¨ıve approach to solving this problem is to enumerate all possible
labels and find the one with the highest joint probability. Continuing with the example
observation sequence x = ¸John, might, wash), examining the chart of probabilities in
the upper panel of Figure 6.4 shows that y

= ¸nn, v, v) is the most likely sequence of
states under our example HMM.
However, a more efficient answer to Question 2 can be computed using the same
intuition in the forward algorithm: determine the best state sequence for a short se-
quence and extend this to easily compute the best sequence for longer ones. This is
known as the Viterbi algorithm. We define γ
t
(q), the Viterbi probability, to be the most
probable sequence of states ending in state q at time t and generating observations
¸x
1
, x
2
, . . . , x
t
). Since we wish to be able to reconstruct the sequence of states, we define
bp
t
(q), the “backpointer”, to be the state used in this sequence at time t −1. The base
case for the recursion is as follows (the state index of −1 is used as a placeholder since
there is no previous best state at time t = 1):
6.2. HIDDEN MARKOV MODELS 127
John might wash
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
DET
ADJ
ADJ
ADJ
ADJ
NN
NN
NN
NN
V
V
V
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
John might wash
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
ADJ
DET
DET
DET
DET
ADJ
ADJ
ADJ
ADJ
NN
NN
NN
NN
V
V
V
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
John might wash
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
NN
DET
DET
DET
DET
ADJ
ADJ
ADJ
ADJ
NN
NN
NN
NN
V
V
V
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
0.0
0.0
0.0
0.0
0.0
0.0
0.000021
0.000009
0.0
0.0
0.0
0.0
0.0
0.0
0.00006
0.00009
John might wash
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
V
DET
DET
DET
DET
ADJ
ADJ
ADJ
ADJ
NN
NN
NN
NN
V
V
V
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
DET
ADJ
NN
V
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
p(x,y) p(x,y) p(x,y) p(x,y)
0.0
John might wash
DET
ADJ
V
NN
0.0 0.0
0.0
0.03
0.0
0.0003
0.0
0.003
0.0
0.000081
0.000099
38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING
A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select
an initial state i according to the distribution π. Step 1, an observation symbol from O
is emitted according to the distribution B
i
. Step 2, a new i is drawn according to A
i
.
Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used
in this process are conditionally independent, the joint probability of this sequence of
observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.
States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =
{the, a, green, . . .} are a subset of the English words. This example illustrates a key
intuition behind many applications of HMMs: states correspond to equivalence classes
or clustering of observations, and a single observation type may associated with several
clusters.
2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
7
1. Given a model M = S, O, θ, and an observation sequence of symbols from O,
x = x
1
, x
2
, . . . , x
τ
, what is the probability that M generated the data, summing
over all possible state sequences, Y?
Pr(x) =

y∈Y
Pr(x, y; θ) = 0.00018
2. Given a model M = S, O, θ and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈Y
Pr(x, y; θ)
3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.
observation sequences x
1
, x
2
, . . . , x

, what are the parameters θ = A, B, π that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈Y
Pr(x
i
, y; θ)
Given our definition of an HMM, the answers to the first two questions are in principle
quite trivial to compute: by iterating over all state sequences Y, the probability that each
generated x can be computed by looking up and multiplying the relevant probabilities
7
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].
38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING
A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select
an initial state i according to the distribution π. Step 1, an observation symbol from O
is emitted according to the distribution B
i
. Step 2, a new i is drawn according to A
i
.
Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used
in this process are conditionally independent, the joint probability of this sequence of
observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.
States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =
{the, a, green, . . .} are a subset of the English words. This example illustrates a key
intuition behind many applications of HMMs: states correspond to equivalence classes
or clustering of observations, and a single observation type may associated with several
clusters.
2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
7
1. Given a model M = S, O, θ, and an observation sequence of symbols from O,
x = x
1
, x
2
, . . . , x
τ
, what is the probability that M generated the data, summing
over all possible state sequences, Y?
Pr(x) =

y∈Y
Pr(x, y; θ)
Pr(x) =

q∈S
α
3
(q) = 0.00018
2. Given a model M = S, O, θ and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈Y
Pr(x, y; θ)
3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.
observation sequences x
1
, x
2
, . . . , x

, what are the parameters θ = A, B, π that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈Y
Pr(x
i
, y; θ)
7
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].
38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING
A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select
an initial state i according to the distribution π. Step 1, an observation symbol from O
is emitted according to the distribution B
i
. Step 2, a new i is drawn according to A
i
.
Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used
in this process are conditionally independent, the joint probability of this sequence of
observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.
States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =
{the, a, green, . . .} are a subset of the English words. This example illustrates a key
intuition behind many applications of HMMs: states correspond to equivalence classes
or clustering of observations, and a single observation type may associated with several
clusters.
2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
7
1. Given a model M = S, O, θ, and an observation sequence of symbols from O,
x = x
1
, x
2
, . . . , x
τ
, what is the probability that M generated the data, summing
over all possible state sequences, Y?
Pr(x) =

y∈Y
Pr(x, y; θ)
α
1
α
2
α
3
2. Given a model M = S, O, θ and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈Y
Pr(x, y; θ)
3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.
observation sequences x
1
, x
2
, . . . , x

, what are the parameters θ = A, B, π that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈Y
Pr(x
i
, y; θ)
Given our definition of an HMM, the answers to the first two questions are in principle
quite trivial to compute: by iterating over all state sequences Y, the probability that each
7
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].
38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING
A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select
an initial state i according to the distribution π. Step 1, an observation symbol from O
is emitted according to the distribution B
i
. Step 2, a new i is drawn according to A
i
.
Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used
in this process are conditionally independent, the joint probability of this sequence of
observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.
States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =
{the, a, green, . . .} are a subset of the English words. This example illustrates a key
intuition behind many applications of HMMs: states correspond to equivalence classes
or clustering of observations, and a single observation type may associated with several
clusters.
2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
7
1. Given a model M = S, O, θ, and an observation sequence of symbols from O,
x = x
1
, x
2
, . . . , x
τ
, what is the probability that M generated the data, summing
over all possible state sequences, Y?
Pr(x) =

y∈Y
Pr(x, y; θ)
α
1
α
2
α
3
2. Given a model M = S, O, θ and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈Y
Pr(x, y; θ)
3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.
observation sequences x
1
, x
2
, . . . , x

, what are the parameters θ = A, B, π that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈Y
Pr(x
i
, y; θ)
Given our definition of an HMM, the answers to the first two questions are in principle
quite trivial to compute: by iterating over all state sequences Y, the probability that each
7
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].
38 CHAPTER 2. EM ALGORITHMS FOR TEXT PROCESSING
A sequence of observations of length τ is generated as follows. Step 0, let t = 0 and select
an initial state i according to the distribution π. Step 1, an observation symbol from O
is emitted according to the distribution B
i
. Step 2, a new i is drawn according to A
i
.
Step 3, t is incremented, if t < τ, the process repeats from Step 1. Since all events used
in this process are conditionally independent, the joint probability of this sequence of
observations and the state sequence used to generate it is the product of the individual
event probabilities.
Figure 2.3 shows an example of a hidden Markov model for part of speech tagging.
States S = {det, adj, nn, v} correspond to the parts of speech, and observations O =
{the, a, green, . . .} are a subset of the English words. This example illustrates a key
intuition behind many applications of HMMs: states correspond to equivalence classes
or clustering of observations, and a single observation type may associated with several
clusters.
2.4.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS
There are three fundamental questions associated with hidden Markov models:
7
1. Given a model M = S, O, θ, and an observation sequence of symbols from O,
x = x
1
, x
2
, . . . , x
τ
, what is the probability that M generated the data, summing
over all possible state sequences, Y?
Pr(x) =

y∈Y
Pr(x, y; θ)
α
1
α
2
α
3
2. Given a model M = S, O, θ and an observation sequence x, what is the most
likely sequence of states that generated the data?
y

= arg max
y∈Y
Pr(x, y; θ)
3. Given a set of states S, an observation vocabulary O, and a series of i.i.d.
observation sequences x
1
, x
2
, . . . , x

, what are the parameters θ = A, B, π that
maximize the likelihood of the training data?
θ

= arg max
θ

i=1

y∈Y
Pr(x
i
, y; θ)
Given our definition of an HMM, the answers to the first two questions are in principle
quite trivial to compute: by iterating over all state sequences Y, the probability that each
7
The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64].
Figure 6.4: Computing the probability of the sequence ¸John, might, wash) under the HMM
given in Figure 6.3 by explicitly summing over all possible sequence labels (upper panel) and
using the forward algorithm (lower panel).
128 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
0.0
John might wash
DET
ADJ
V
NN
0.0 0.0
0.0
0.03
0.0
0.0003
0.0
0.003
0.0
0.00006
0.00009
2.4. HIDDEN MARKOV MODELS 41
We have now shown how to compute the probability of being in any state q at any time
t using the forward algorithm. The probability of the full sequence is the probability of
being in time t and in any state, so the answer to Question 1 can be computed simply
by summing over α values at time |x| for all states:
Pr(x; θ) =
n

q=1
α
|x|
(q)
Figure ?? illustrates the two techniques for computing the forward probability. The
upper panel shows the naive approach, enumerating all 4
3
possible labels y

of x and
computing their joint probability Pr(x, y

). Summing over all y

, the marginal proba-
bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting
of 4 × 3 cells. Summing over the final column, also computes 0.00018.
2.4.3 THE VITERBI ALGORITHM
If we have an observation sequence and wish to find the most probable label under a
model, the naive approach to solving this problem is to enumerate all possible labels
and find the one with the highest joint probability. For example, examining the chart
of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states
for the observation sequence John, might, wash under our example HMM.
However, a more efficient answer to Question 2 can be computed using the same
intuition as we used in the forward algorithm: determine the best state sequence for
a short sequence and extend this to easily compute the best sequence for longer ones.
This is known as the Viterbi algorithm. We define γ
t
(q), the Viterbi probability, to
be the most probable sequence of states of ending in state q at time t and generating
observations x
1
, x
2
, . . . , x
t
. Since we wish to be able to reconstruct the sequence of
states, we define bp
t
(q), the “back pointer”, to be the state used in this sequence at
time t − 1. The base case for the recursion is as follows (the state index of −1 is used
as a place-holder since there is no previous best state at time 1):
γ
1
γ
2
γ
3
γ
1
(q) = π
q
· B
q
(x
1
)
bp
1
(q) = −1
The recursion is similar to that of the forward algorithm, except rather than summing
over previous states, the maximum value of all possible trajectories into state r at time
t is computed. Note that the back-pointer just records the index of the originating state
– a separate computation is not necessary.
2.4. HIDDEN MARKOV MODELS 41
We have now shown how to compute the probability of being in any state q at any time
t using the forward algorithm. The probability of the full sequence is the probability of
being in time t and in any state, so the answer to Question 1 can be computed simply
by summing over α values at time |x| for all states:
Pr(x; θ) =
n

q=1
α
|x|
(q)
Figure ?? illustrates the two techniques for computing the forward probability. The
upper panel shows the naive approach, enumerating all 4
3
possible labels y

of x and
computing their joint probability Pr(x, y

). Summing over all y

, the marginal proba-
bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting
of 4 × 3 cells. Summing over the final column, also computes 0.00018.
2.4.3 THE VITERBI ALGORITHM
If we have an observation sequence and wish to find the most probable label under a
model, the naive approach to solving this problem is to enumerate all possible labels
and find the one with the highest joint probability. For example, examining the chart
of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states
for the observation sequence John, might, wash under our example HMM.
However, a more efficient answer to Question 2 can be computed using the same
intuition as we used in the forward algorithm: determine the best state sequence for
a short sequence and extend this to easily compute the best sequence for longer ones.
This is known as the Viterbi algorithm. We define γ
t
(q), the Viterbi probability, to
be the most probable sequence of states of ending in state q at time t and generating
observations x
1
, x
2
, . . . , x
t
. Since we wish to be able to reconstruct the sequence of
states, we define bp
t
(q), the “back pointer”, to be the state used in this sequence at
time t − 1. The base case for the recursion is as follows (the state index of −1 is used
as a place-holder since there is no previous best state at time 1):
γ
1
γ
2
γ
3
γ
1
(q) = π
q
· B
q
(x
1
)
bp
1
(q) = −1
The recursion is similar to that of the forward algorithm, except rather than summing
over previous states, the maximum value of all possible trajectories into state r at time
t is computed. Note that the back-pointer just records the index of the originating state
– a separate computation is not necessary.
2.4. HIDDEN MARKOV MODELS 41
We have now shown how to compute the probability of being in any state q at any time
t using the forward algorithm. The probability of the full sequence is the probability of
being in time t and in any state, so the answer to Question 1 can be computed simply
by summing over α values at time |x| for all states:
Pr(x; θ) =
n

q=1
α
|x|
(q)
Figure ?? illustrates the two techniques for computing the forward probability. The
upper panel shows the naive approach, enumerating all 4
3
possible labels y

of x and
computing their joint probability Pr(x, y

). Summing over all y

, the marginal proba-
bility of x is found to be 0.00018. The lower panel shows the forward trellis, consisting
of 4 × 3 cells. Summing over the final column, also computes 0.00018.
2.4.3 THE VITERBI ALGORITHM
If we have an observation sequence and wish to find the most probable label under a
model, the naive approach to solving this problem is to enumerate all possible labels
and find the one with the highest joint probability. For example, examining the chart
of probabilities in Figure ?? shows that nn, v, v is the most likely sequence of states
for the observation sequence John, might, wash under our example HMM.
However, a more efficient answer to Question 2 can be computed using the same
intuition as we used in the forward algorithm: determine the best state sequence for
a short sequence and extend this to easily compute the best sequence for longer ones.
This is known as the Viterbi algorithm. We define γ
t
(q), the Viterbi probability, to
be the most probable sequence of states of ending in state q at time t and generating
observations x
1
, x
2
, . . . , x
t
. Since we wish to be able to reconstruct the sequence of
states, we define bp
t
(q), the “back pointer”, to be the state used in this sequence at
time t − 1. The base case for the recursion is as follows (the state index of −1 is used
as a place-holder since there is no previous best state at time 1):
γ
1
γ
2
γ
3
γ
1
(q) = π
q
· B
q
(x
1
)
bp
1
(q) = −1
The recursion is similar to that of the forward algorithm, except rather than summing
over previous states, the maximum value of all possible trajectories into state r at time
t is computed. Note that the back-pointer just records the index of the originating state
– a separate computation is not necessary.
Figure 6.5: Computing the most likely state sequence that generated ¸John, might, wash) un-
der the HMM given in Figure 6.3 using the Viterbi algorithm. The most likely state sequence
is highlighted in bold and could be recovered programmatically by following backpointers from
the maximal probability cell in the last column to the first column (thicker arrows).
γ
1
(q) = π
q
B
q
(x
1
)
bp
1
(q) = −1
The recursion is similar to that of the forward algorithm, except rather than summing
over previous states, the maximum value of all possible trajectories into state r at time
t is computed. Note that the backpointer simply records the index of the originating
state—a separate computation is not necessary.
γ
t
(r) = max
q∈S
γ
t−1
(q) A
q
(r) B
r
(x
t
)
bp
t
(r) = arg max
q∈S
γ
t−1
(q) A
q
(r) B
r
(x
t
)
To compute the best sequence of states, y

, the state with the highest probability path
at time [x[ is selected, and then the backpointers are followed, recursively, to construct
the rest of the sequence:
y

]x]
= arg max
q∈S
γ
]x]
(q)
y

t−1
= bp
t
(y
t
)
Figure 6.5 illustrates a Viterbi trellis, including backpointers that have been used to
compute the most likely state sequence.
6.2. HIDDEN MARKOV MODELS 129
John might wash
DET
ADJ
V
NN
Figure 6.6: A “fully observable” HMM training instance. The output sequence is at the top
of the figure, and the corresponding states and transitions are shown in the trellis below.
6.2.4 PARAMETER ESTIMATION FOR HMMS
We now turn to Question 3: given a set of states o and observation vocabulary O,
what are the parameters θ

= ¸A, B, π) that maximize the likelihood of a set of train-
ing examples, ¸x
1
, x
2
, . . . , x

)?
9
Since our model is constructed in terms of variables
whose values we cannot observe (the state sequence) in the training data, we may train
it to optimize the marginal likelihood (summing over all state sequences) of x using
EM. Deriving the EM update equations requires only the application of the techniques
presented earlier in this chapter and some differential calculus. However, since the for-
malism is cumbersome, we will skip a detailed derivation, but readers interested in more
information can find it in the relevant citations [78, 125].
In order to make the update equations as intuitive as possible, consider a fully
observable HMM, that is, one where both the emissions and the state sequence are
observable in all training instances. In this case, a training instance can be depicted
as shown in Figure 6.6. When this is the case, such as when we have a corpus of sentences
in which all words have already been tagged with their parts of speech, the maximum
likelihood estimate for the parameters can be computed in terms of the counts of the
number of times the process transitions from state q to state r in all training instances,
T(q → r); the number of times that state q emits symbol o, O(q ↑ o); and the number
of times the process starts in state q, I(q). In this example, the process starts in state
nn; there is one nn → v transition and one v → v transition. The nn state emits John
in the first time step, and v state emits might and wash in the second and third time
steps, respectively. We also define N(q) to be the number of times the process enters
state q. The maximum likelihood estimates of the parameters in the fully observable
case are:
9
Since an HMM models sequences, its training data consists of a collection of example sequences.
130 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
π
q
=
I(q)
=

r
I(r)
A
q
(r) =
T(q → r)
N(q) =

r
T(q → r
t
)
B
q
(o) =
O(q ↑ o)
N(q) =

o
O(q ↑ o
t
)
(6.2)
For example, to compute the emission parameters from state nn, we simply need to keep
track of the number of times the process is in state nn and what symbol it generates
at each of these times. Transition probabilities are computed similarly: to compute, for
example, the distribution A
det
(), that is, the probabilities of transitioning away from
state det, we count the number of times the process is in state det, and keep track
of what state the process transitioned into at the next time step. This counting and
normalizing be accomplished using the exact same counting and relative frequency al-
gorithms that we described in Section 3.3. Thus, in the fully observable case, parameter
estimation is not a new algorithm at all, but one we have seen before.
How should the model parameters be estimated when the state sequence is not
provided? It turns out that the update equations have the satisfying form where the
optimal parameter values for iteration i + 1 are expressed in terms of the expectations of
the counts referenced in the fully observed case, according to the posterior distribution
over the latent variables given the observations x and the parameters θ
(i)
:
π
q
=
E[I(q)]

A
q
(r) =
E[T(q → r)]
E[N(q)]
B
q
(o) =
E[O(q ↑ o)]
E[N(q)]
(6.3)
Because of the independence assumptions made in the HMM, the update equations
consist of 2 [o[ + 1 independent optimization problems, just as was the case with the
‘observable’ HMM. Solving for the initial state distribution, π, is one problem; there
are [o[ solving for the transition distributions A
q
() from each state q; and [o[ solving
for the emissions distributions B
q
() from each state q. Furthermore, we note that the
following must hold:
E[N(q)] =

r∈S
E[T(q → r)] =

o∈c
E[O(q ↑ o)]
As a result, the optimization problems (i.e., Equations 6.2) require completely indepen-
dent sets of statistics, which we will utilize later to facilitate efficient parallelization in
MapReduce.
How can the expectations in Equation 6.3 be understood? In the fully observed
training case, between every time step, there is exactly one transition taken and the
source and destination states are observable. By progressing through the Markov chain,
we can let each transition count as ‘1’, and we can accumulate the total number of times
each kind of transition was taken (by each kind, we simply mean the number of times
that one state follows another, for example, the number of times nn follows det). These
statistics can then in turn be used to compute the MLE for an ‘observable’ HMM, as
6.2. HIDDEN MARKOV MODELS 131
described above. However, when the transition sequence is not observable (as is most
often the case), we can instead imagine that at each time step, every possible transition
(there are [o[
2
of them, and typically [o[ is quite small) is taken, with a particular
probability. The probability used is the posterior probability of the transition, given the
model and an observation sequence (we describe how to compute this value below).
By summing over all the time steps in the training data, and using this probability as
the ‘count’ (rather than ‘1’ as in the observable case), we compute the expected count
of the number of times a particular transition was taken, given the training sequence.
Furthermore, since the training instances are statistically independent, the value of the
expectations can be computed by processing each training instance independently and
summing the results.
Similarly for the necessary emission counts (the number of times each symbol in
O was generated by each state in o), we assume that any state could have generated
the observation. We must therefore compute the probability of being in every state at
each time point, which is then the size of the emission ‘count’. By summing over all
time steps we compute the expected count of the number of times that a particular
state generated a particular symbol. These two sets of expectations, which are written
formally here, are sufficient to execute the M-step.
E[O(q ↑ o)] =
]x]

i=1
Pr(y
i
= q[x; θ) δ(x
i
, o) (6.4)
E[T(q → r)] =
]x]−1

i=1
Pr(y
i
= q, y
i+1
= r[x; θ) (6.5)
Posterior probabilities. The expectations necessary for computing the M-step in
HMM training are sums of probabilities that a particular transition is taken, given an
observation sequence, and that some state emits some observation symbol, given an
observation sequence. These are referred to as posterior probabilities, indicating that
they are the probability of some event whose distribution we have a prior belief about,
after addition evidence has been taken into consideration (here, the model parame-
ters characterize our prior beliefs, and the observation sequence is the evidence). Both
posterior probabilities can be computed by combining the forward probabilities, α
t
(),
which give the probability of reaching some state at time t, by any path, and generat-
ing the observations ¸x
1
, x
2
, . . . , x
t
), with backward probabilities, β
t
(), which give the
probability of starting in some state at time t and generating the rest of the sequence
¸x
t+1
, x
t+2
, . . . , x
]x]
), using any sequence of states to do so. The algorithm for comput-
ing the backward probabilities is given a bit later. Once the forward and backward
probabilities have been computed, the state transition posterior probabilities and the
emission posterior probabilities can be written as follows:
132 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
a b c b
S
1
S
2
S
3
α
2
(2) β
3
(2)
b
Figure 6.7: Using forward and backward probabilities to compute the posterior probability of
the dashed transition, given the observation sequence a b b c b. The shaded area on the left
corresponds to the forward probability α
2
(s
2
), and the shaded area on the right corresponds to
the backward probability β
3
(s
2
).
Pr(y
i
= q[x; θ) = α
i
(q) β
i
(q) (6.6)
Pr(y
i
= q, y
i+1
= r[x; θ) = α
i
(q) A
q
(r) B
r
(x
i+1
) β
i+1
(r) (6.7)
Equation 6.6 is the probability of being in state q at time i, given x, and the correctness
of the expression should be clear from the definitions of forward and backward proba-
bilities. The intuition for Equation 6.7, the probability of taking a particular transition
at a particular time, is also not complicated: it is the product of four conditionally
independent probabilities: the probability of getting to state q at time i (having gener-
ated the first part of the sequence), the probability of taking transition q → r (which
is specified in the parameters, θ), the probability of generating observation x
i+1
from
state r (also specified in θ), and the probability of generating the rest of the sequence,
along any path. A visualization of the quantities used in computing this probability is
shown in Figure 6.7. In this illustration, we assume an HMM with o = ¦s
1
, s
2
, s
3
¦ and
O = ¦a, b, c¦.
The backward algorithm. Like the forward and Viterbi algorithms introduced
above to answer Questions 1 and 2, the backward algorithm uses dynamic program-
ming to incrementally compute β
t
(). Its base case starts at time [x[, and is defined as
follows:
6.2. HIDDEN MARKOV MODELS 133
β
]x]
(q) = 1
To understand the intuition for this base case, keep in mind that since the backward
probabilities β
t
() are the probability of generating the remainder of the sequence after
time t (as well as being in some state), and since there is nothing left to generate after
time [x[, the probability must be 1. The recursion is defined as follows:
β
t
(q) =

r∈S
β
t+1
(r) A
q
(r) B
r
(x
t+1
)
Unlike the forward and Viterbi algorithms, the backward algorithm is computed from
right to left and makes no reference to the start probabilities, π.
6.2.5 FORWARD-BACKWARD TRAINING: SUMMARY
In the preceding section, we have shown how to compute all quantities needed to find
the parameter settings θ
(i+1)
using EM training with a hidden Markov model / =
¸o, O, θ
(i)
). To recap: each training instance x is processed independently, using the
parameter settings of the current iteration, θ
(i)
. For each x in the training data, the
forward and backward probabilities are computed using the algorithms given above
(for this reason, this training algorithm is often referred to as the forward-backward
algorithm). The forward and backward probabilities are in turn used to compute the
expected number of times the underlying Markov process enters into each state, the
number of times each state generates each output symbol type, and the number of times
each state transitions into each other state. These expectations are summed over all
training instances, completing the E-step. The M-step involves normalizing the expected
counts computed in the E-step using the calculations in Equation 6.3, which yields θ
(i+1)
.
The process then repeats from the E-step using the new parameters. The number of
iterations required for convergence depends on the quality of the initial parameters,
and the complexity of the model. For some applications, only a handful of iterations
are necessary, whereas for others, hundreds may be required.
Finally, a few practical considerations: HMMs have a non-convex likelihood surface
(meaning that it has the equivalent of many hills and valleys in the number of dimensions
corresponding to the number of parameters in the model). As a result, EM training is
only guaranteed to find a local maximum, and the quality of the learned model may vary
considerably, depending on the initial parameters that are used. Strategies for optimal
selection of initial parameters depend on the phenomena being modeled. Additionally,
if some parameter is assigned a probability of 0 (either as an initial value or during
one of the M-step parameter updates), EM will never change this in future iterations.
134 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
This can be useful, since it provides a way of constraining the structures of the Markov
model; however, one must be aware of this behavior.
Another pitfall to avoid when implementing HMMs is arithmetic underflow.
HMMs typically define a massive number of sequences, and so the probability of any one
of them is often vanishingly small—so small that they often underflow standard floating
point representations. A very common solution to this problem is to represent prob-
abilities using their logarithms. Note that expected counts do not typically have this
problem and can be represented using normal floating point numbers. See Section 5.4
for additional discussion on working with log probabilities.
6.3 EM IN MAPREDUCE
Expectation maximization algorithms fit quite naturally into the MapReduce program-
ming model. Although the model being optimized determines the details of the required
computations, MapReduce implementations of EM algorithms share a number of char-
acteristics:
• Each iteration of EM is one MapReduce job.
• A controlling process (i.e., driver program) spawns the MapReduce jobs, keeps
track of the number of iterations and convergence criteria.
• Model parameters θ
(i)
, which are static for the duration of the MapReduce job,
are loaded by each mapper from HDFS or other data provider (e.g., a distributed
key-value store).
• Mappers map over independent training instances, computing partial latent vari-
able posteriors (or summary statistics, such as expected counts).
• Reducers sum together the required training statistics and solve one or more of
the M-step optimization problems.
• Combiners, which sum together the training statistics, are often quite effective at
reducing the amount of data that must be written to disk.
The degree of parallelization that can be attained depends on the statistical indepen-
dence assumed in the model and in the derived quantities required to solve the opti-
mization problems in the M-step. Since parameters are estimated from a collection of
samples that are assumed to be i.i.d., the E-step can generally be parallelized effectively
since every training instance can be processed independently of the others. In the limit,
in fact, each independent training instance could be processed by a separate mapper!
10
10
Although the wisdom of doing this is questionable, given that the startup costs associated with individual map
tasks in Hadoop may be considerable.
6.3. EM IN MAPREDUCE 135
Reducers, however, must aggregate the statistics necessary to solve the optimiza-
tion problems as required by the model. The degree to which these may be solved
independently depends on the structure of the model, and this constrains the number
of reducers that may be used. Fortunately, many common models (such as HMMs) re-
quire solving several independent optimization problems in the M-step. In this situation,
a number of reducers may be run in parallel. Still, it is possible that in the worst case,
the M-step optimization problem will not decompose into independent subproblems,
making it necessary to use a single reducer.
6.3.1 HMM TRAINING IN MAPREDUCE
As we would expect, the training of hidden Markov models parallelizes well in Map-
Reduce. The process can be summarized as follows: in each iteration, mappers pro-
cess training instances, emitting expected event counts computed using the forward-
backward algorithm introduced in Section 6.2.4. Reducers aggregate the expected
counts, completing the E-step, and then generate parameter estimates for the next
iteration using the updates given in Equation 6.3.
This parallelization strategy is effective for several reasons. First, the majority of
the computational effort in HMM training is the running of the forward and backward
algorithms. Since there is no limit on the number of mappers that may be run, the
full computational resources of a cluster may be brought to bear to solve this problem.
Second, since the M-step of an HMM training iteration with [o[ states in the model
consists of 2 [o[ + 1 independent optimization problems that require non-overlapping
sets of statistics, this may be exploited with as many as 2 [o[ + 1 reducers running
in parallel. While the optimization problem is computationally trivial, being able to
reduce in parallel helps avoid the data bottleneck that would limit performance if only
a single reducer is used.
The quantities that are required to solve the M-step optimization problem are
quite similar to the relative frequency estimation example discussed in Section 3.3;
however, rather than counts of observed events, we aggregate expected counts of events.
As a result of the similarity, we can employ the stripes representation for aggregating
sets of related values, as described in Section 3.2. A pairs approach that requires less
memory at the cost of slower performance is also feasible.
HMM training mapper. The pseudo-code for the HMM training mapper is given
in Figure 6.8. The input consists of key-value pairs with a unique id as the key and
a training instance (e.g., a sentence) as the value. For each training instance, 2n + 1
stripes are emitted with unique keys, and every training instance emits the same set
of keys. Each unique key corresponds to one of the independent optimization problems
that will be solved in the M-step. The outputs are:
136 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
1: class Mapper
2: method Initialize(integer iteration)
3: ¸o, O) ← ReadModel
4: θ ← ¸A, B, π) ← ReadModelParams(iteration)
5: method Map(sample id, sequence x)
6: α ← Forward(x, θ) cf. Section 6.2.2
7: β ← Backward(x, θ) cf. Section 6.2.4
8: I ← new AssociativeArray Initial state expectations
9: for all q ∈ o do Loop over states
10: I¦q¦ ← α
1
(q) β
1
(q)
11: O ← new AssociativeArray of AssociativeArray Emissions
12: for t = 1 to [x[ do Loop over observations
13: for all q ∈ o do Loop over states
14: O¦q¦¦x
t
¦ ← O¦q¦¦x
t
¦ + α
t
(q) β
t
(q)
15: t ← t + 1
16: T ← new AssociativeArray of AssociativeArray Transitions
17: for t = 1 to [x[ −1 do Loop over observations
18: for all q ∈ o do Loop over states
19: for all r ∈ o do Loop over states
20: T¦q¦¦r¦ ← T¦q¦¦r¦ + α
t
(q) A
q
(r) B
r
(x
t+1
) β
t+1
(r)
21: t ← t + 1
22: Emit(string ‘initial ’, stripe I)
23: for all q ∈ o do Loop over states
24: Emit(string ‘emit from ’ + q, stripe O¦q¦)
25: Emit(string ‘transit from ’ + q, stripe T¦q¦)
Figure 6.8: Mapper pseudo-code for training hidden Markov models using EM. The mappers
map over training instances (i.e., sequences of observations x
i
) and generate the expected counts
of initial states, emissions, and transitions taken to generate the sequence.
6.3. EM IN MAPREDUCE 137
1: class Combiner
2: method Combine(string t, stripes [C
1
, C
2
, . . .])
3: C
f
← new AssociativeArray
4: for all stripe C ∈ stripes [C
1
, C
2
, . . .] do
5: Sum(C
f
, C)
6: Emit(string t, stripe C
f
)
1: class Reducer
2: method Reduce(string t, stripes [C
1
, C
2
, . . .])
3: C
f
← new AssociativeArray
4: for all stripe C ∈ stripes [C
1
, C
2
, . . .] do
5: Sum(C
f
, C)
6: z ← 0
7: for all ¸k, v) ∈ C
f
do
8: z ← z + v
9: P
f
← new AssociativeArray Final parameters vector
10: for all ¸k, v) ∈ C
f
do
11: P
f
¦k¦ ← v/z
12: Emit(string t, stripe P
f
)
Figure 6.9: Combiner and reducer pseudo-code for training hidden Markov models using EM.
The HMMs considered in this book are fully parameterized by multinomial distributions, so
reducers do not require special logic to handle different types of model parameters (since they
are all of the same type).
138 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
1. the probabilities that the unobserved Markov process begins in each state q, with
a unique key designating that the values are initial state counts;
2. the expected number of times that state q generated each emission symbol o (the
set of emission symbols included will be just those found in each training instance
x), with a key indicating that the associated value is a set of emission counts from
state q; and
3. the expected number of times state q transitions to each state r, with a key indi-
cating that the associated value is a set of transition counts from state q.
HMM training reducer. The reducer for one iteration of HMM training, shown
together with an optional combiner in Figure 6.9, aggregates the count collections as-
sociated with each key by summing them. When the values for each key have been
completely aggregated, the associative array contains all of the statistics necessary to
compute a subset of the parameters for the next EM iteration. The optimal parameter
settings for the following iteration are computed simply by computing the relative fre-
quency of each event with respect to its expected count at the current iteration. The
new computed parameters are emitted from the reducer and written to HDFS. Note
that they will be spread across 2 [o[ + 1 keys, representing initial state probabilities
π, transition probabilities A
q
for each state q, and emission probabilities B
q
for each
state q.
6.4 CASE STUDY: WORD ALIGNMENT FOR STATISTICAL
MACHINE TRANSLATION
To illustrate the real-world benefits of expectation maximization algorithms using Map-
Reduce, we turn to the problem of word alignment, which is an important task in sta-
tistical machine translation that is typically solved using models whose parameters are
learned with EM.
We begin by giving a brief introduction to statistical machine translation and
the phrase-based translation approach; for a more comprehensive introduction, refer to
[85, 97]. Fully-automated translation has been studied since the earliest days of elec-
tronic computers. After successes with code-breaking during World War II, there was
considerable optimism that translation of human languages would be another soluble
problem. In the early years, work on translation was dominated by manual attempts to
encode linguistic knowledge into computers—another instance of the ‘rule-based’ ap-
proach we described in the introduction to this chapter. These early attempts failed to
live up to the admittedly optimistic expectations. For a number of years, the idea of
fully automated translation was viewed with skepticism. Not only was constructing a
translation system labor intensive, but translation pairs had to be developed indepen-
6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 139
dently, meaning that improvements in a Russian-English translation system could not,
for the most part, be leveraged to improve a French-English system.
After languishing for a number of years, the field was reinvigorated in the late
1980s when researchers at IBM pioneered the development of statistical machine trans-
lation (SMT), which took a data-driven approach to solving the problem of machine
translation, attempting to improve both the quality of translation while reducing the
cost of developing systems [29]. The core idea of SMT is to equip the computer to learn
how to translate, using example translations which are produced for other purposes, and
modeling the process as a statistical process with some parameters θ relating strings
in a source language (typically denoted as f) to strings in a target language (typically
denoted as e):
e

= arg max
e
Pr(e[f; θ)
With the statistical approach, translation systems can be developed cheaply and
quickly for any language pair, as long as there is sufficient training data available.
Furthermore, improvements in learning algorithms and statistical modeling can yield
benefits in many translation pairs at once, rather than being specific to individual
language pairs. Thus, SMT, like many other topics we are considering in this book,
is an attempt to leverage the vast quantities of textual data that is available to solve
problems that would otherwise require considerable manual effort to encode specialized
knowledge. Since the advent of statistical approaches to translation, the field has grown
tremendously and numerous statistical models of translation have been developed, with
many incorporating quite specialized knowledge about the behavior of natural language
as biases in their learning algorithms.
6.4.1 STATISTICAL PHRASE-BASED TRANSLATION
One approach to statistical translation that is simple yet powerful is called phrase-based
translation [86]. We provide a rough outline of the process since it is representative of
most state-of-the-art statistical translation systems, such as the one used inside Google
Translate.
11
Phrase-based translation works by learning how strings of words, called
phrases, translate between languages.
12
Example phrase pairs for Spanish-English trans-
lation might include ¸los estudiantes, the students), ¸los estudiantes, some students),
and ¸soy, i am). From a few hundred thousand sentences of example translations, many
millions of such phrase pairs may be automatically learned.
The starting point is typically a parallel corpus (also called bitext), which contains
pairs of sentences in two languages that are translations of each other. Parallel corpora
11
http://translate.google.com
12
Phrases are simply sequences of words; they are not required to correspond to the definition of a phrase in any
linguistic theory.
140 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
are frequently generated as the byproduct of an organization’s effort to disseminate in-
formation in multiple languages, for example, proceedings of the Canadian Parliament
in French and English, and text generated by the United Nations in many different
languages. The parallel corpus is then annotated with word alignments, which indicate
which words in one language correspond to words in the other. By using these word
alignments as a skeleton, phrases can be extracted from the sentence that is likely to
preserve the meaning relationships represented by the word alignment. While an expla-
nation of the process is not necessary here, we mention it as a motivation for learning
word alignments, which we show below how to compute with EM. After phrase ex-
traction, each phrase pair is associated with a number of scores which, taken together,
are used to compute the phrase translation probability, a conditional probability that
reflects how likely the source phrase translates into the target phrase. We briefly note
that although EM could be utilized to learn the phrase translation probabilities, this
is not typically done in practice since the maximum likelihood solution turns out to be
quite bad for this problem. The collection of phrase pairs and their scores are referred to
as the translation model. In addition to the translation model, phrase-based translation
depends on a language model, which gives the probability of a string in the target lan-
guage. The translation model attempts to preserve the meaning of the source language
during the translation process, while the language model ensures that the output is
fluent and grammatical in the target language. The phrase-based translation process is
summarized in Figure 6.10.
A language model gives the probability that a string of words w =
¸w
1
, w
2
, . . . , w
n
), written as w
n
1
for short, is a string in the target language. By the
chain rule of probability, we get:
Pr(w
n
1
) = Pr(w
1
) Pr(w
2
[w
1
) Pr(w
3
[w
2
1
) . . . Pr(w
n
[w
n−1
1
) =
n

k=1
Pr(w
k
[w
k−1
1
) (6.8)
Due to the extremely large number of parameters involved in estimating such a model
directly, it is customary to make the Markov assumption, that the sequence histories
only depend on prior local context. That is, an n-gram language model is equivalent to
a (n −1)th-order Markov model. Thus, we can approximate P(w
k
[w
k−1
1
) as follows:
bigrams: P(w
k
[w
k−1
1
) ≈ P(w
k
[w
k−1
) (6.9)
trigrams: P(w
k
[w
k−1
1
) ≈ P(w
k
[w
k−1
w
k−2
) (6.10)
n-grams: P(w
k
[w
k−1
1
) ≈ P(w
k
[w
k−1
k−n+1
) (6.11)
The probabilities used in computing Pr(w
n
1
) based on an n-gram language model are
generally estimated from a monolingual corpus of target language text. Since only target
6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 141
Word Alignment Phrase Extraction
Training Data
i saw the small table
vi la mesa pequeña
(vi, i saw)
(la mesa pequeña, the small table)

Parallel Sentences
he sat at the table
the service was good
Target-Language Text
Translation Model
Language
Model
Target Language Text
Decoder
Foreign Input Sentence English Output Sentence
maria no daba una bofetada a la bruja verde mary did not slap the green witch
Figure 6.10: The standard phrase-based machine translation architecture. The translation
model is constructed with phrases extracted from a word-aligned parallel corpus. The language
model is estimated from a monolingual corpus. Both serve as input to the decoder, which
performs the actual translation.
Maria no dio una bofetada a la bruja verde
Mary t i l t th it h Mary not
did not
no
give a slap to the witch green
slap
a slap
to the
green witch by
did not give to
the
the witch slap the witch slap
Figure 6.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde
by a phrase-based model. The best possible translation path is indicated with a dashed line.
142 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
language text is necessary (without any additional annotation), language modeling has
been well served by large-data approaches that take advantage of the vast quantities of
text available on the web.
To translate an input sentence f, the phrase-based decoder creates a matrix of all
translation possibilities of all substrings in the input string, as an example illustrates in
Figure 6.11. A sequence of phrase pairs is selected such that each word in f is translated
exactly once.
13
The decoder seeks to find the translation that maximizes the product of
the translation probabilities of the phrases used and the language model probability of
the resulting string in the target language. Because the phrase translation probabilities
are independent of each other and the Markov assumption made in the language model,
this may be done efficiently using dynamic programming. For a detailed introduction
to phrase-based decoding, we refer the reader to a recent textbook by Koehn [85].
6.4.2 BRIEF DIGRESSION: LANGUAGE MODELING WITH MAPREDUCE
Statistical machine translation provides the context for a brief digression on distributed
parameter estimation for language models using MapReduce, and provides another
example illustrating the effectiveness data-driven approaches in general. We briefly
touched upon this work in Chapter 1. Even after making the Markov assumption, train-
ing n-gram language models still requires estimating an enormous number of parame-
ters: potentially V
n
, where V is the number of words in the vocabulary. For higher-order
models (e.g., 5-grams) used in real-world applications, the number of parameters can
easily exceed the number of words from which to estimate those parameters. In fact,
most n-grams will never be observed in a corpus, no matter how large. To cope with this
sparseness, researchers have developed a number of smoothing techniques [102], which
all share the basic idea of moving probability mass from observed to unseen events in
a principled manner. For many applications, a state-of-the-art approach is known as
Kneser-Ney smoothing [35].
In 2007, Brants et al. [25] reported experimental results that answered an inter-
esting question: given the availability of large corpora (i.e., the web), could a simpler
smoothing strategy, applied to more text, beat Kneser-Ney in a machine translation
task? It should come as no surprise that the answer is yes. Brants et al. introduced a
technique known as “stupid backoff” that was exceedingly simple and so na¨ıve that the
resulting model didn’t even define a valid probability distribution (it assigned arbitrary
scores as opposed to probabilities). The simplicity, however, afforded an extremely scal-
able implementations in MapReduce. With smaller corpora, stupid backoff didn’t work
as well as Kneser-Ney in generating accurate and fluent translations. However, as the
amount of data increased, the gap between stupid backoff and Kneser-Ney narrowed,
13
The phrases may not necessarily be selected in a strict left-to-right order. Being able to vary the order of the
phrases used is necessary since languages may express the same ideas using different word orders.
6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 143
and eventually disappeared with sufficient data. Furthermore, with stupid backoff it
was possible to train a language model on more data than was feasible with Kneser-
Ney smoothing. Applying this language model to a machine translation task yielded
better results than a (smaller) language model trained with Kneser-Ney smoothing.
The role of the language model in statistical machine translation is to select
fluent, grammatical translations from a large hypothesis space: the more training data a
language model has access to, the better its description of relevant language phenomena
and hence its ability to select good translations. Once again, large data triumphs! For
more information about estimating language models using MapReduce, we refer the
reader to a forthcoming book from Morgan & Claypool [26].
6.4.3 WORD ALIGNMENT
Word alignments, which are necessary for building phrase-based translation models (as
well as many other more sophisticated translation models), can be learned automatically
using EM. In this section, we introduce a popular alignment model based on HMMs.
In the statistical model of word alignment considered here, the observable variables
are the words in the source and target sentences (conventionally written using the
variables f and e, respectively), and their alignment is a latent variable. To make this
model tractable, we assume that words are translated independently of one another,
which means that the model’s parameters include the probability of any word in the
source language translating to any word in the target language. While this independence
assumption is problematic in many ways, it results in a simple model structure that
admits efficient inference yet produces reasonable alignments. Alignment models that
make this assumption generate a string e in the target language by selecting words in
the source language according to a lexical translation distribution. The indices of the
words in f used to generate each word in e are stored in an alignment variable, a.
14
This means that the variable a
i
indicates the source word position of the i
th
target word
generated, and [a[ = [e[. Using these assumptions, the probability of an alignment and
translation can be written as follows:
Pr(e, a[f) = Pr(a[f, e)
. ¸¸ .
Alignment probability

]e]

i=1
Pr(e
i
[f
ai
)
. ¸¸ .
Lexical probability
Since we have parallel corpora consisting of only ¸f, e) pairs, we can learn the parame-
ters for this model using EM and treating a as a latent variable. However, to combat
14
In the original presentation of statistical lexical translation models, a special null word is added to the source
sentences, which permits words to be inserted ‘out of nowhere’. Since this does not change any of the important
details of training, we omit it from our presentation for simplicity.
144 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
data sparsity in the alignment probability, we must make some further simplifying as-
sumptions. By letting the probability of an alignment depend only on the position of
the previous aligned word we capture a valuable insight (namely, words that are nearby
in the source language will tend to be nearby in the target language), and our model
acquires the structure of an HMM [150]:
Pr(e, a[f) =
]e]

i=1
Pr(a
i
[a
i−1
)
. ¸¸ .
Transition probability

]e]

i=1
Pr(e
i
[f
ai
)
. ¸¸ .
Emission probability
This model can be trained using the forward-backward algorithm described in the pre-
vious section, summing over all settings of a, and the best alignment for a sentence pair
can be found using the Viterbi algorithm.
To properly initialize this HMM, it is conventional to further simplify the align-
ment probability model, and use this simpler model to learn initial lexical translation
(emission) parameters for the HMM. The favored simplification is to assert that all
alignments are uniformly probable:
Pr(e, a[f) =
1
[f[
]e]

]e]

i=1
Pr(e
i
[f
ai
)
This model is known as IBM Model 1. It is attractive for initialization because it is
convex everywhere, and therefore EM will learn the same solution regardless of initial-
ization. Finally, while the forward-backward algorithm could be used to compute the
expected counts necessary for training this model by setting A
q
(r) to be a constant
value for all q and r, the uniformity assumption means that the expected emission
counts can be estimated in time O([e[ [f[), rather than time O([e[ [f[
2
) required by
the forward-backward algorithm.
6.4.4 EXPERIMENTS
How well does a MapReduce word aligner for statistical machine translation perform?
We describe previously-published results [54] that compared a Java-based Hadoop im-
plementation against a highly optimized word aligner called Giza++ [112], which was
written in C++ and designed to run efficiently on a single core. We compared the train-
ing time of Giza++ and our aligner on a Hadoop cluster with 19 slave nodes, each with
two single-core processors and two disks (38 cores total).
Figure 6.12 shows the performance of Giza++ in terms of the running time of a
single EM iteration for both Model 1 and the HMM alignment model as a function of
the number of training pairs. Both axes in the figure are on a log scale, but the ticks
6.4. CASE STUDY: WORDALIGNMENT FORSTATISTICAL MACHINE TRANSLATION 145
on the y-axis are aligned with ‘meaningful’ time intervals rather than exact orders of
magnitude. There are three things to note. First, the running time scales linearly with
the size of the training data. Second, the HMM is a constant factor slower than Model 1.
Third, the alignment process is quite slow as the size of the training data grows—at one
million sentences, a single iteration takes over three hours to complete! Five iterations
are generally necessary to train the models, which means that full training takes the
better part of a day.
In Figure 6.13 we plot the running time of our MapReduce implementation run-
ning on the 38-core cluster described above. For reference, we plot points indicating
what 1/38 of the running time of the Giza++ iterations would be at each data size,
which gives a rough indication of what an ‘ideal’ parallelization could achieve, assum-
ing that there was no overhead associated with distributing computation across these
machines. Three things may be observed in the results. First, as the amount of data
increases, the relative cost of the overhead associated with distributing data, marshal-
ing and aggregating counts, decreases. At one million sentence pairs of training data,
the HMM alignment iterations begin to approach optimal runtime efficiency. Second,
Model 1, which we observe is light on computation, does not approach the theoretical
performance of an ideal parallelization, and in fact, has almost the same running time
as the HMM alignment algorithm. We conclude that the overhead associated with dis-
tributing and aggregating data is significant compared to the Model 1 computations,
although a comparison with Figure 6.12 indicates that the MapReduce implementation
is still substantially faster than the single core implementation, at least once a certain
training data size is reached. Finally, we note that, in comparison to the running times
of the single-core implementation, at large data sizes, there is a significant advantage
to using the distributed implementation, even of Model 1.
Although these results do confound several variables (Java vs. C++ performance,
memory usage patterns), it is reasonable to expect that the confounds would tend to
make the single-core system’s performance appear relatively better than the MapReduce
system (which is, of course, the opposite pattern from what we actually observe). Fur-
thermore, these results show that when computation is distributed over a cluster of
many machines, even an unsophisticated implementation of the HMM aligner could
compete favorably with a highly optimized single-core system whose performance is
well-known to many people in the MT research community.
Why are these results important? Perhaps the most significant reason is that
the quantity of parallel data that is available to train statistical machine translation
models is ever increasing, and as is the case with so many problems we have encountered,
more data leads to improvements in translation quality [54]. Recently a corpus of one
billion words of French-English data was mined automatically from the web and released
146 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
3 s
10 s
30 s
90 s
5 min
20 min
60 min
3 hrs
10000 100000 1e+06
A
v
e
r
a
g
e

i
t
e
r
a
t
i
o
n

l
a
t
e
n
c
y

(
s
e
c
o
n
d
s
)
Corpus size (sentences)
Model 1
HMM
Figure 6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM
training iterations at various corpus sizes.
3 s
10 s
30 s
90 s
5 min
20 min
60 min
3 hrs
10000 100000 1e+06
T
i
m
e

(
s
e
c
o
n
d
s
)
Corpus size (sentences)
Optimal Model 1 (Giza/38)
Optimal HMM (Giza/38)
MapReduce Model 1 (38 M/R)
MapReduce HMM (38 M/R)
Figure 6.13: Running times of our MapReduce implementation of Model 1 and HMM training
iterations at various corpus sizes. For reference, 1/38 running times of the Giza++ models are
shown.
6.5. EM-LIKE ALGORITHMS 147
publicly [33].
15
Single-core solutions to model construction simply cannot keep pace with
the amount of translated data that is constantly being produced. Fortunately, several
independent researchers have shown that existing modeling algorithms can be expressed
naturally and effectively using MapReduce, which means that we can take advantage
of this data. Furthermore, the results presented here show that even at data sizes
that may be tractable on single machines, significant performance improvements are
attainable using MapReduce implementations. This improvement reduces experimental
turnaround times, which allows researchers to more quickly explore the solution space—
which will, we hope, lead to rapid new developments in statistical machine translation.
For the reader interested in statistical machine translation, there is an open source
Hadoop-based MapReduce implementation of a training pipeline for phrase-based trans-
lation that includes word alignment, phrase extraction, and phrase scoring [56].
6.5 EM-LIKE ALGORITHMS
This chapter has focused on expectation maximization algorithms and their implemen-
tation in the MapReduce programming framework. These important algorithms are
indispensable for learning models with latent structure from unannotated data, and
they can be implemented quite naturally in MapReduce. We now explore some related
learning algorithms that are similar to EM but can be used to solve more general
problems, and discuss their implementation.
In this section we focus on gradient-based optimization, which refers to a class of
techniques used to optimize any objective function, provided it is differentiable with
respect to the parameters being optimized. Gradient-based optimization is particularly
useful in the learning of maximum entropy (maxent) models [110] and conditional ran-
dom fields (CRF) [87] that have an exponential form and are trained to maximize
conditional likelihood. In addition to being widely used supervised classification models
in text processing (meaning that during training, both the data and their annotations
must be observable), their gradients take the form of expectations. As a result, some of
the previously-introduced techniques are also applicable for optimizing these models.
6.5.1 GRADIENT-BASED OPTIMIZATION AND LOG-LINEAR MODELS
Gradient-based optimization refers to a class of iterative optimization algorithms that
use the derivatives of a function to find the parameters that yield a minimal or maximal
value of that function. Obviously, these algorithms are only applicable in cases where a
useful objective exists, is differentiable, and its derivatives can be efficiently evaluated.
Fortunately, this is the case for many important problems of interest in text process-
ing. For the purposes of this discussion, we will give examples in terms of minimizing
functions.
15
http://www.statmt.org/wmt10/translation-task.html
148 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
Assume that we have some real-valued function F(θ) where θ is a k-dimensional
vector and that F is differentiable with respect to θ. Its gradient is defined as:
∇F(θ) =
_
∂F
∂θ
1
(θ),
∂F
∂θ
2
(θ), . . . ,
∂F
∂θ
k
(θ)
_
The gradient has two crucial properties that are exploited in gradient-based optimiza-
tion. First, the gradient ∇F is a vector field that points in the direction of the greatest
increase of F and whose magnitude indicates the rate of increase. Second, if θ

is a
(local) minimum of F, then the following is true:
∇F(θ

) = 0
An extremely simple gradient-based minimization algorithm produces a series of
parameter estimates θ
(1)
, θ
(2)
, . . . by starting with some initial parameter settings θ
(1)
and updating parameters through successive iterations according to the following rule:
θ
(i+1)
= θ
(i)
−η
(i)
∇F(θ
(i)
) (6.12)
The parameter η
(i)
> 0 is a learning rate which indicates how quickly the algorithm
moves along the gradient during iteration i. Provided this value is small enough that
F decreases, this strategy will find a local minimum of F. However, while simple, this
update strategy may converge slowly, and proper selection of η is non-trivial. More
sophisticated algorithms perform updates that are informed by approximations of the
second derivative, which are estimated by successive evaluations of ∇F(θ), and can
converge much more rapidly [96].
Gradient-based optimization in MapReduce. Gradient-based optimization al-
gorithms can often be implemented effectively in MapReduce. Like EM, where the
structure of the model determines the specifics of the realization, the details of the
function being optimized determines how it should best be implemented, and not ev-
ery function optimization problem will be a good fit for MapReduce. Nevertheless,
MapReduce implementations of gradient-based optimization tend to have the following
characteristics:
• Each optimization iteration is one MapReduce job.
• The objective should decompose linearly across training instances. This implies
that the gradient also decomposes linearly, and therefore mappers can process
input data in parallel. The values they emit are pairs ¸F(θ), ∇F(θ)), which are
linear components of the objective and gradient.
6.5. EM-LIKE ALGORITHMS 149
• Evaluation of the function and its gradient is often computationally expensive
because they require processing lots of data. This make parallelization with Map-
Reduce worthwhile.
• Whether more than one reducer can run in parallel depends on the specific opti-
mization algorithm being used. Some, like the trivial algorithm of Equation 6.12
treat the dimensions of θ independently, whereas many are sensitive to global
properties of ∇F(θ). In the latter case, parallelization across multiple reducers is
non-trivial.
• Reducer(s) sum the component objective/gradient pairs, compute the total objec-
tive and gradient, run the optimization algorithm, and emit θ
(i+1)
.
• Many optimization algorithms are stateful and must persist their state between
optimization iterations. This may either be emitted together with θ
(i+1)
or written
to the distributed file system as a side effect of the reducer. Such external side
effects must be handled carefully; refer to Section 2.2 for a discussion.
Parameter learning for log-linear models. Gradient-based optimization tech-
niques can be quite effectively used to learn the parameters of probabilistic models with
a log-linear parameterization [100]. While a comprehensive introduction to these models
is beyond the scope of this book, such models are used extensively in text processing
applications, and their training using gradient-based optimization, which may otherwise
be computationally expensive, can be implemented effectively using MapReduce. We
therefore include a brief summary.
Log-linear models are particularly useful for supervised learning (unlike the un-
supervised models learned with EM), where an annotation y ∈ ¸ is available for every
x ∈ A in the training data. In this case, it is possible to directly model the conditional
distribution of label given input:
Pr(y[x; θ) =
exp

i
θ
i
H
i
(x, y)

y
exp

i
θ
i
H
i
(x, y
t
)
In this expression, H
i
are real-valued functions sensitive to features of the input and
labeling. The parameters of the model is selected so as to minimize the negative condi-
tional log likelihood of a set of training instances ¸¸x, y)
1
, ¸x, y)
2
, . . .), which we assume
to be i.i.d.:
F(θ) =

¸x,y)
−log Pr(y[x; θ) (6.13)
θ

= arg min
θ
F(θ) (6.14)
150 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING
As Equation 6.13 makes clear, the objective decomposes linearly across training in-
stances, meaning it can be optimized quite well in MapReduce. The gradient derivative
of F with respect to θ
i
can be shown to have the following form [141]:
16
∂F
∂θ
i
(θ) =

¸x,y)
_
H
i
(x, y) −E
Pr(y

]x;θ)
[H
i
(x, y
t
)]
¸
The expectation in the second part of the gradient’s expression can be computed using a
variety of techniques. However, as we saw with EM, when very large event spaces are be-
ing modeled, as is the case with sequence labeling, enumerating all possible values y can
become computationally intractable. And, as was the case with HMMs, independence
assumptions can be used to enable efficient computation using dynamic programming.
In fact, the forward-backward algorithm introduced in Section 6.2.4 can, with only min-
imal modification, be used to compute the expectation E
Pr(y

]x;θ)
[H
i
(x, y
t
)] needed in
CRF sequence models, as long as the feature functions respect the same Markov as-
sumption that is made in HMMs. For more information about inference in CRFs using
the forward-backward algorithm, we refer the reader to Sha et al. [140].
As we saw in the previous section, MapReduce offers significant speedups when
training iterations require running the forward-backward algorithm. The same pattern
of results holds when training linear CRFs.
6.6 SUMMARY AND ADDITIONAL READINGS
This chapter focused on learning the parameters of statistical models from data, using
expectation maximization algorithms or gradient-based optimization techniques. We
focused especially on EM algorithms for three reasons. First, these algorithms can be
expressed naturally in the MapReduce programming model, making them a good exam-
ple of how to express a commonly-used algorithm in this new framework. Second, many
models, such as the widely-used hidden Markov model (HMM) trained using EM, make
independence assumptions that permit an high degree of parallelism in both the E- and
M-steps. Thus, they are particularly well-positioned to take advantage of large clusters.
Finally, EM algorithms are unsupervised learning algorithms, which means that they
have access to far more training data than comparable supervised approaches. This is
quite important. In Chapter 1, when we hailed large data as the “rising tide that lifts
all boats” to yield more effective algorithms, we were mostly referring to unsupervised
approaches, given that the manual effort required to generate annotated data remains
a bottleneck in many supervised approaches. Data acquisition for unsupervised algo-
rithms is often as simple as crawling specific web sources, given the enormous quantities
of data available “for free”. This, combined with the ability of MapReduce to process
16
This assumes that when x, y is present the model is fully observed (i.e., there are no additional latent
variables).
6.6. SUMMARY AND ADDITIONAL READINGS 151
large datasets in parallel, provides researchers with an effective strategy for developing
increasingly-effective applications.
Since EM algorithms are relatively computationally expensive, even for small
amounts of data, this led us to consider how related supervised learning models (which
typically have much less training data available), can also be implemented in Map-
Reduce. The discussion demonstrates that not only does MapReduce provide a means
for coping with ever-increasing amounts of data, but it is also useful for parallelizing
expensive computations. Although MapReduce has been designed with mostly data-
intensive applications in mind, the ability to leverage clusters of commodity hardware
to parallelize computationally-expensive algorithms is an important use case.
Additional Readings. Because of its ability to leverage large amounts of training
data, machine learning is an attractive problem for MapReduce and an area of active
research. Chu et al. [37] presented general formulations of a variety of machine learning
problems, focusing on a normal form for expressing a variety of machine learning algo-
rithms in MapReduce. The Apache Mahout project is an open-source implementation
of these and other learning algorithms,
17
and it is also the subject of a forthcoming
book [116]. Issues associated with a MapReduce implementation of latent Dirichlet
allocation (LDA), which is another important unsupervised learning technique, with
certain similarities to EM, have been explored by Wang et al. [151].
17
http://lucene.apache.org/mahout/
152
C H A P T E R 7
Closing Remarks
The need to process enormous quantities of data has never been greater. Not only
are terabyte- and petabyte-scale datasets rapidly becoming commonplace, but there
is consensus that great value lies buried in them, waiting to be unlocked by the right
computational tools. In the commercial sphere, business intelligence—driven by the
ability to gather data from a dizzying array of sources—promises to help organizations
better understand their customers and the marketplace, hopefully leading to better
business decisions and competitive advantages. For engineers building information pro-
cessing tools and applications, larger datasets lead to more effective algorithms for a
wide range of tasks, from machine translation to spam detection. In the natural and
physical sciences, the ability to analyze massive amounts of data may provide the key
to unlocking the secrets of the cosmos or the mysteries of life.
In the preceding chapters, we have shown how MapReduce can be exploited to
solve a variety of problems related to text processing at scales that would have been
unthinkable a few years ago. However, no tool—no matter how powerful or flexible—
can be perfectly adapted to every task, so it is only fair to discuss the limitations of
the MapReduce programming model and survey alternatives. Section 7.1 covers online
learning algorithms and Monte Carlo simulations, which are examples of algorithms
that require maintaining global state. As we have seen, this is difficult to accomplish
in MapReduce. Section 7.2 discusses alternative programming models, and the book
concludes in Section 7.3.
7.1 LIMITATIONS OF MAPREDUCE
As we have seen throughout this book, solutions to many interesting problems in text
processing do not require global synchronization. As a result, they can be expressed
naturally in MapReduce, since map and reduce tasks run independently and in iso-
lation. However, there are many examples of algorithms that depend crucially on the
existence of shared global state during processing, making them difficult to implement
in MapReduce (since the single opportunity for global synchronization in MapReduce
is the barrier between the map and reduce phases of processing).
The first example is online learning. Recall from Chapter 6 the concept of learning
as the setting of parameters in a statistical model. Both EM and the gradient-based
learning algorithms we described are instances of what are known as batch learning
algorithms. This simply means that the full “batch” of training data is processed before
any updates to the model parameters are made. On one hand, this is quite reasonable:
7.1. LIMITATIONS OF MAPREDUCE 153
updates are not made until the full evidence of the training data has been weighed
against the model. An earlier update would seem, in some sense, to be hasty. However,
it is generally the case that more frequent updates can lead to more rapid convergence
of the model (in terms of number of training instances processed), even if those updates
are made by considering less data [24]. Thinking in terms of gradient optimization (see
Section 6.5), online learning algorithms can be understood as computing an approx-
imation of the true gradient, using only a few training instances. Although only an
approximation, the gradient computed from a small subset of training instances is of-
ten quite reasonable, and the aggregate behavior of multiple updates tends to even out
errors that are made. In the limit, updates can be made after every training instance.
Unfortunately, implementing online learning algorithms in MapReduce is problem-
atic. The model parameters in a learning algorithm can be viewed as shared global state,
which must be updated as the model is evaluated against training data. All processes
performing the evaluation (presumably the mappers) must have access to this state.
In a batch learner, where updates occur in one or more reducers (or, alternatively, in
the driver code), synchronization of this resource is enforced by the MapReduce frame-
work. However, with online learning, these updates must occur after processing smaller
numbers of instances. This means that the framework must be altered to support faster
processing of smaller datasets, which goes against the design choices of most existing
MapReduce implementations. Since MapReduce was specifically optimized for batch
operations over large amounts of data, such a style of computation would likely result
in inefficient use of resources. In Hadoop, for example, map and reduce tasks have con-
siderable startup costs. This is acceptable because in most circumstances, this cost is
amortized over the processing of many key-value pairs. However, for small datasets,
these high startup costs become intolerable. An alternative is to abandon shared global
state and run independent instances of the training algorithm in parallel (on different
portions of the data). A final solution is then arrived at by merging individual results.
Experiments, however, show that the merged solution is inferior to the output of running
the training algorithm on the entire dataset [52].
A related difficulty occurs when running what are called Monte Carlo simula-
tions, which are used to perform inference in probabilistic models where evaluating or
representing the model exactly is impossible. The basic idea is quite simple: samples
are drawn from the random variables in the model to simulate its behavior, and then
simple frequency statistics are computed over the samples. This sort of inference is par-
ticularly useful when dealing with so-called nonparametric models, which are models
whose structure is not specified in advance, but is rather inferred from training data.
For an illustration, imagine learning a hidden Markov model, but inferring the num-
ber of states, rather than having them specified. Being able to parallelize Monte Carlo
simulations would be tremendously valuable, particularly for unsupervised learning ap-
154 CHAPTER 7. CLOSING REMARKS
plications where they have been found to be far more effective than EM-based learning
(which requires specifying the model). Although recent work [10] has shown that the
delays in synchronizing sample statistics due to parallel implementations do not neces-
sarily damage the inference, MapReduce offers no natural mechanism for managing the
global shared state that would be required for such an implementation.
The problem of global state is sufficiently pervasive that there has been substan-
tial work on solutions. One approach is to build a distributed datastore capable of
maintaining the global state. However, such a system would need to be highly scal-
able to be used in conjunction with MapReduce. Google’s BigTable [34], which is a
sparse, distributed, persistent multidimensional sorted map built on top of GFS, fits
the bill, and has been used in exactly this manner. Amazon’s Dynamo [48], which is a
distributed key-value store (with a very different architecture), might also be useful in
this respect, although it wasn’t originally designed with such an application in mind.
Unfortunately, it is unclear if the open-source implementations of these two systems
(HBase and Cassandra, respectively) are sufficiently mature to handle the low-latency
and high-throughput demands of maintaining global state in the context of massively
distributed processing (but recent benchmarks are encouraging [40]).
7.2 ALTERNATIVE COMPUTING PARADIGMS
Streaming algorithms [3] represent an alternative programming model for dealing with
large volumes of data with limited computational and storage resources. This model
assumes that data are presented to the algorithm as one or more streams of inputs that
are processed in order, and only once. The model is agnostic with respect to the source
of these streams, which could be files in a distributed file system, but more interestingly,
data from an “external” source or some other data gathering device. Stream processing
is very attractive for working with time-series data (news feeds, tweets, sensor readings,
etc.), which is difficult in MapReduce (once again, given its batch-oriented design).
Furthermore, since streaming algorithms are comparatively simple (because there is
only so much that can be done with a particular training instance), they can often take
advantage of modern GPUs, which have a large number of (relatively simple) functional
units [104]. In the context of text processing, streaming algorithms have been applied
to language modeling [90], translation modeling [89], and detecting the first mention of
news event in a stream [121].
The idea of stream processing has been generalized in the Dryad framework as
arbitrary dataflow graphs [75, 159]. A Dryad job is a directed acyclic graph where each
vertex represents developer-specified computations and edges represent data channels
that capture dependencies. The dataflow graph is a logical computation graph that is
automatically mapped onto physical resources by the framework. At runtime, channels
7.2. ALTERNATIVE COMPUTING PARADIGMS 155
are used to transport partial results between vertices, and can be realized using files,
TCP pipes, or shared memory.
Another system worth mentioning is Pregel [98], which implements a program-
ming model inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. Pregel
was specifically designed for large-scale graph algorithms, but unfortunately there are
few published details at present. However, a longer description is anticipated in a forth-
coming paper [99].
What is the significance of these developments? The power of MapReduce derives
from providing an abstraction that allows developers to harness the power of large
clusters. As anyone who has taken an introductory computer science course would know,
abstractions manage complexity by hiding details and presenting well-defined behaviors
to users of those abstractions. This process makes certain tasks easier, but others more
difficult, if not impossible. MapReduce is certainly no exception to this generalization,
and one of the goals of this book has been to give the reader a better understanding of
what’s easy to do in MapReduce and what its limitations are. But of course, this begs
the obvious question: What other abstractions are available in the massively-distributed
datacenter environment? Are there more appropriate computational models that would
allow us to tackle classes of problems that are difficult for MapReduce?
Dryad and Pregel are alternative answers to these questions. They share in pro-
viding an abstraction for large-scale distributed computations, separating the what from
the how of computation and isolating the developer from the details of concurrent pro-
gramming. They differ, however, in how distributed computations are conceptualized:
functional-style programming, arbitrary dataflows, or BSP. These conceptions represent
different tradeoffs between simplicity and expressivity: for example, Dryad is more flex-
ible than MapReduce, and in fact, MapReduce can be trivially implemented in Dryad.
However, it remains unclear, at least at present, which approach is more appropriate
for different classes of applications. Looking forward, we can certainly expect the de-
velopment of new models and a better understanding of existing ones. MapReduce is
not the end, and perhaps not even the best. It is merely the first of many approaches
to harness large-scaled distributed computing resources.
Even within the Hadoop/MapReduce ecosystem, we have already observed the
development of alternative approaches for expressing distributed computations. For
example, there is a proposal to add a third merge phase after map and reduce to
better support relational operations [36]. Pig [114], which was inspired by Google’s
Sawzall [122], can be described as a data analytics platform that provides a lightweight
scripting language for manipulating large datasets. Although Pig scripts (in a language
called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine,
constructs in the language allow developers to specify data transformations (filtering,
joining, grouping, etc.) at a much higher level. Similarly, Hive [68], another open-source
156 CHAPTER 7. CLOSING REMARKS
project, provides an abstraction on top of Hadoop that allows users to issue SQL queries
against large relational datasets stored in HDFS. Hive queries (in HiveQL) “compile
down” to Hadoop jobs by the Hive query engine. Therefore, the system provides a data
analysis tool for users who are already comfortable with relational databases, while
simultaneously taking advantage of Hadoop’s data processing capabilities.
7.3 MAPREDUCE AND BEYOND
The capabilities necessary to tackle large-data problems are already within reach by
many and will continue to become more accessible over time. By scaling “out” with
commodity servers, we have been able to economically bring large clusters of machines
to bear on problems of interest. But this has only been possible with corresponding
innovations in software and how computations are organized on a massive scale. Impor-
tant ideas include: moving processing to the data, as opposed to the other way around;
also, emphasizing throughput over latency for batch tasks by sequential scans through
data, avoiding random seeks. Most important of all, however, is the development of
new abstractions that hide system-level details from the application developer. These
abstractions are at the level of entire datacenters, and provide a model using which pro-
grammers can reason about computations at a massive scale without being distracted
by fine-grained concurrency management, fault tolerance, error recovery, and a host of
other issues in distributed computing. This, in turn, paves the way for innovations in
scalable algorithms that can run on petabyte-scale datasets.
None of these points are new or particularly earth-shattering—computer scientists
have known about these principles for decades. However, MapReduce is unique in that,
for the first time, all these ideas came together and were demonstrated on practical
problems at scales unseen before, both in terms of computational resources and the
impact on the daily lives of millions. The engineers at Google deserve a tremendous
amount of credit for that, and also for sharing their insights with the rest of the world.
Furthermore, the engineers and executives at Yahoo deserve a lot of credit for starting
the open-source Hadoop project, which has made MapReduce accessible to everyone
and created the vibrant software ecosystem that flourishes today. Add to that the
advent of utility computing, which eliminates capital investments associated with cluster
infrastructure, large-data processing capabilities are now available “to the masses” with
a relatively low barrier to entry.
The golden age of massively distributed computing is finally upon us.
157
Bibliography
[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and
Alexander Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS
technologies for analytical workloads. In Proceedings of the 35th International
Conference on Very Large Data Base (VLDB 2009), pages 922–933, Lyon, France,
2009.
[2] R´eka Albert and Albert-L´aszl´o Barab´asi. Statistical mechanics of complex net-
works. Reviews of Modern Physics, 74:47–97, 2002.
[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approxi-
mating the frequency moments. In Proceedings of the 28th Annual ACM Sympo-
sium on Theory of Computing (STOC ’96), pages 20–29, Philadelphia, Pennsyl-
vania, 1996.
[4] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Heller-
stein, and Russell C. Sears. BOOM: Data-centric programming in the datacenter.
Technical Report UCB/EECS-2009-98, Electrical Engineering and Computer Sci-
ences, University of California at Berkeley, 2009.
[5] Gene Amdahl. Validity of the single processor approach to achieving large-scale
computing capabilities. In Proceedings of the AFIPS Spring Joint Computer Con-
ference, pages 483–485, 1967.
[6] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu
Pucha, Prasenjit Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do
we really need to reinvent the storage stack? In Proceedings of the 2009 Workshop
on Hot Topics in Cloud Computing (HotCloud 09), San Diego, California, 2009.
[7] Thomas Anderson, Michael Dahlin, Jeanna Neefe, David Patterson, Drew Roselli,
and Randolph Wang. Serverless network file systems. In Proceedings of the 15th
ACM Symposium on Operating Systems Principles (SOSP 1995), pages 109–126,
Copper Mountain Resort, Colorado, 1995.
[8] Vo Ngoc Anh and Alistair Moffat. Inverted index compression using word-aligned
binary codes. Information Retrieval, 8(1):151–166, 2005.
[9] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H.
Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion
158 CHAPTER 7. CLOSING REMARKS
Stoica, and Matei Zaharia. Above the clouds: A Berkeley view of cloud comput-
ing. Technical Report UCB/EECS-2009-28, Electrical Engineering and Computer
Sciences, University of California at Berkeley, 2009.
[10] Arthur Asuncion, Padhraic Smyth, and Max Welling. Asynchronous distributed
learning of topic models. In Advances in Neural Information Processing Systems
21 (NIPS 2008), pages 81–88, Vancouver, British Columbia, Canada, 2008.
[11] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and
Fabrizio Silvestri. Challenges on distributed web retrieval. In Proceedings of the
IEEE 23rd International Conference on Data Engineering (ICDE 2007), pages
6–20, Istanbul, Turkey, 2007.
[12] Ricardo Baeza-Yates, Carlos Castillo, and Vicente L´opez. PageRank increase un-
der different collusion topologies. In Proceedings of the First International Work-
shop on Adversarial Information Retrieval on the Web (AIRWeb 2005), pages
17–24, Chiba, Japan, 2005.
[13] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vas-
silis Plachouras, and Fabrizio Silvestri. The impact of caching on search engines.
In Proceedings of the 30th Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval (SIGIR 2007), pages 183–190,
Amsterdam, The Netherlands, 2007.
[14] Michele Banko and Eric Brill. Scaling to very very large corpora for natural
language disambiguation. In Proceedings of the 39th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2001), pages 26–33, Toulouse, France,
2001.
[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,
Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualiza-
tion. In Proceedings of the 19th ACM Symposium on Operating Systems Principles
(SOSP 2003), pages 164–177, Bolton Landing, New York, 2003.
[16] Luiz Andr´e Barroso, Jeffrey Dean, and Urs H¨olzle. Web search for a planet: The
Google cluster architecture. IEEE Micro, 23(2):22–28, 2003.
[17] Luiz Andr´e Barroso and Urs H¨olzle. The case for energy-proportional computing.
Computer, 40(12):33–37, 2007.
[18] Luiz Andr´e Barroso and Urs H¨olzle. The Datacenter as a Computer: An Introduc-
tion to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers,
2009.
7.3. MAPREDUCE AND BEYOND 159
[19] Jacek Becla, Andrew Hanushevsky, Sergei Nikolaev, Ghaleb Abdulla, Alex Szalay,
Maria Nieto-Santisteban, Ani Thakar, and Jim Gray. Designing a multi-petabyte
database for LSST. SLAC Publications SLAC-PUB-12292, Stanford Linear Ac-
celerator Center, May 2006.
[20] Jacek Becla and Daniel L. Wang. Lessons learned from managing a petabyte.
In Proceedings of the Second Biennial Conference on Innovative Data Systems
Research (CIDR 2005), Asilomar, California, 2005.
[21] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science,
323(5919):1297–1298, 2009.
[22] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM
Transactions on Internet Technology, 5(1):92–128, 2005.
[23] Jorge Luis Borges. Collected Fictions (translated by Andrew Hurley). Penguin,
1999.
[24] L´eon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg,
editors, Advanced Lectures on Machine Learning, Lecture Notes in Artificial In-
telligence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.
[25] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean.
Large language models in machine translation. In Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Computa-
tional Natural Language Learning, pages 858–867, Prague, Czech Republic, 2007.
[26] Thorsten Brants and Peng Xu. Distributed Language Models. Morgan & Claypool
Publishers, 2010.
[27] Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. Data-
intensive question answering. In Proceedings of the Tenth Text REtrieval Confer-
ence (TREC 2001), pages 393–400, Gaithersburg, Maryland, 2001.
[28] Frederick P. Brooks. The Mythical Man-Month: Essays on Software Engineering,
Anniversary Edition. Addison-Wesley, Reading, Massachusetts, 1995.
[29] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L.
Mercer. The mathematics of statistical machine translation: Parameter estima-
tion. Computational Linguistics, 19(2):263–311, 1993.
[30] Stefan B¨ uttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information
Retrieval: Implementing and Evaluating Search Engines. MIT Press, Cambridge,
Massachusetts, 2010.
160 CHAPTER 7. CLOSING REMARKS
[31] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona
Brandic. Cloud computing and emerging IT platforms: Vision, hype, and reality
for delivering computing as the 5th utility. Future Generation Computer Systems,
25(6):599–616, 2009.
[32] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping
to provide high I/O data rates. Computer Systems, 4(4):405–436, 1991.
[33] Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. Find-
ings of the 2009 workshop on statistical machine translation. In Proceedings of the
Fourth Workshop on Statistical Machine Translation (StatMT ’09), pages 1–28,
Athens, Greece, 2009.
[34] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wal-
lach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber.
Bigtable: A distributed storage system for structured data. In Proceedings of the
7th Symposium on Operating System Design and Implementation (OSDI 2006),
pages 205–218, Seattle, Washington, 2006.
[35] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing tech-
niques for language modeling. In Proceedings of the 34th Annual Meeting of
the Association for Computational Linguistics (ACL 1996), pages 310–318, Santa
Cruz, California, 1996.
[36] Hung chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-
Reduce-Merge: Simplified relational data processing on large clusters. In Pro-
ceedings of the 2007 ACM SIGMOD International Conference on Management of
Data, pages 1029–1040, Beijing, China, 2007.
[37] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, An-
drew Ng, and Kunle Olukotun. Map-Reduce for machine learning on multicore.
In Advances in Neural Information Processing Systems 19 (NIPS 2006), pages
281–288, Vancouver, British Columbia, Canada, 2006.
[38] Kenneth W. Church and Patrick Hanks. Word association norms, mutual infor-
mation, and lexicography. Computational Linguistics, 16(1):22–29, 1990.
[39] Jonathan Cohen. Graph twiddling in a MapReduce world. Computing in Science
and Engineering, 11(4):29–41, 2009.
[40] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Rus-
sell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the
First ACM Symposium on Cloud Computing (ACM SOCC 2010), Indianapolis,
Indiana, 2010.
7.3. MAPREDUCE AND BEYOND 161
[41] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to
Algorithms. MIT Press, Cambridge, Massachusetts, 1990.
[42] W. Bruce Croft, Donald Meztler, and Trevor Strohman. Search Engines: Infor-
mation Retrieval in Practice. Addison-Wesley, Reading, Massachusetts, 2009.
[43] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser,
Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards
a realistic model of parallel computation. ACM SIGPLAN Notices, 28(7):1–12,
1993.
[44] Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical
part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural
Language Processing, pages 133–140, Trento, Italy, 1992.
[45] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on
large clusters. In Proceedings of the 6th Symposium on Operating System Design
and Implementation (OSDI 2004), pages 137–150, San Francisco, California, 2004.
[46] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on
large clusters. Communications of the ACM, 51(1):107–113, 2008.
[47] Jeffrey Dean and Sanjay Ghemawat. MapReduce: A flexible data processing tool.
Communications of the ACM, 53(1):72–77, 2010.
[48] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and
Werner Vogels. Dynamo: Amazon’s highly available key-value store. In Proceed-
ings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007),
pages 205–220, Stevenson, Washington, 2007.
[49] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood
from incomplete data via the EM algorithm. Journal of the Royal Statistical
Society. Series B (Methodological), 39(1):1–38, 1977.
[50] David J. DeWitt and Jim Gray. Parallel database systems: The future of high
performance database systems. Communications of the ACM, 35(6):85–98, 1992.
[51] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R.
Stonebraker, and David Wood. Implementation techniques for main memory
database systems. ACM SIGMOD Record, 14(2):1–8, 1984.
[52] Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by
confidence-weighted parameter combination. Machine Learning, 79:123–149, 2010.
162 CHAPTER 7. CLOSING REMARKS
[53] Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web
question answering: Is more always better? In Proceedings of the 25th Annual
International ACM SIGIR Conference on Research and Development in Informa-
tion Retrieval (SIGIR 2002), pages 291–298, Tampere, Finland, 2002.
[54] Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, easy, and cheap:
Construction of statistical machine translation models with MapReduce. In Pro-
ceedings of the Third Workshop on Statistical Machine Translation at ACL 2008,
pages 199–207, Columbus, Ohio, 2008.
[55] John R. Firth. A synopsis of linguistic theory 1930–55. In Studies in Linguis-
tic Analysis, Special Volume of the Philological Society, pages 1–32. Blackwell,
Oxford, 1957.
[56] Qin Gao and Stephan Vogel. Training phrase-based machine translation models on
the cloud: Open source machine translation toolkit Chaski. The Prague Bulletin
of Mathematical Linguistics, 93:37–46, 2010.
[57] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File Sys-
tem. In Proceedings of the 19th ACM Symposium on Operating Systems Principles
(SOSP 2003), pages 29–43, Bolton Landing, New York, 2003.
[58] Seth Gilbert and Nancy Lynch. Brewer’s Conjecture and the feasibility of consis-
tent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59,
2002.
[59] Michelle Girvan and Mark E. J. Newman. Community structure in social and
biological networks. Proceedings of the National Academy of Science, 99(12):7821–
7826, 2002.
[60] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction
to Parallel Computing. Addison-Wesley, Reading, Massachusetts, 2003.
[61] Mark S. Granovetter. The strength of weak ties. The American Journal of Soci-
ology, 78(6):1360–1380, 1973.
[62] Mark S. Granovetter. The strength of weak ties: A network theory revisited.
Sociological Theory, 1:201–233, 1983.
[63] Zolt´an Gy¨ongyi and Hector Garcia-Molina. Web spam taxonomy. In Proceedings
of the First International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb 2005), pages 39–47, Chiba, Japan, 2005.
7.3. MAPREDUCE AND BEYOND 163
[64] Per Hage and Frank Harary. Island Networks: Communication, Kinship, and
Classification Structures in Oceania. Cambridge University Press, Cambridge,
England, 1996.
[65] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness
of data. Communications of the ACM, 24(2):8–12, 2009.
[66] James Hamilton. On designing and deploying Internet-scale services. In Proceed-
ings of the 21st Large Installation System Administration Conference (LISA ’07),
pages 233–244, Dallas, Texas, 2007.
[67] James Hamilton. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost,
low power servers for Internet-scale services. In Proceedings of the Fourth Bien-
nial Conference on Innovative Data Systems Research (CIDR 2009), Asilomar,
California, 2009.
[68] Jeff Hammerbacher. Information platforms and the rise of the data scientist.
In Toby Segaran and Jeff Hammerbacher, editors, Beautiful Data, pages 73–84.
O’Reilly, Sebastopol, California, 2009.
[69] Zelig S. Harris. Mathematical Structures of Language. Wiley, New York, 1968.
[70] Md. Rafiul Hassan and Baikunth Nath. Stock market forecasting using hidden
Markov models: A new approach. In Proceedings of the 5th International Confer-
ence on Intelligent Systems Design and Applications (ISDA ’05), pages 192–196,
Wroclaw, Poland, 2005.
[71] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong
Wang. Mars: A MapReduce framework on graphics processors. In Proceedings
of the 17th International Conference on Parallel Architectures and Compilation
Techniques (PACT 2008), pages 260–269, Toronto, Ontario, Canada, 2008.
[72] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-
Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009.
[73] Tony Hey, Stewart Tansley, and Kristin Tolle. Jim Gray on eScience: A trans-
formed scientific method. In Tony Hey, Stewart Tansley, and Kristin Tolle, editors,
The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research,
Redmond, Washington, 2009.
[74] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev Satya-
narayanan, Robert Sidebotham, and Michael West. Scale and performance in
a distributed file system. ACM Transactions on Computer Systems, 6(1):51–81,
1988.
164 CHAPTER 7. CLOSING REMARKS
[75] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad:
Distributed data-parallel programs from sequential building blocks. In Proceedings
of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
(EuroSys 2007), pages 59–72, Lisbon, Portugal, 2007.
[76] Adam Jacobs. The pathologies of big data. ACM Queue, 7(6), 2009.
[77] Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley, Reading,
Massachusetts, 1992.
[78] Frederick Jelinek. Statistical methods for speech recognition. MIT Press, Cam-
bridge, Massachusetts, 1997.
[79] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Pearson,
Upper Saddle River, New Jersey, 2009.
[80] U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, and
Jure Leskovec. HADI: Fast diameter estimation and mining in massive graphs
with Hadoop. Technical Report CMU-ML-08-117, School of Computer Science,
Carnegie Mellon University, 2008.
[81] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. PEGASUS: A
peta-scale graph mining system—implementation and observations. In Proceed-
ings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM
2009), pages 229–238, Miami, Floria, 2009.
[82] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation
for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on
Discrete Algorithms (SODA 2010), Austin, Texas, 2010.
[83] Aaron Kimball, Sierra Michels-Slettvet, and Christophe Bisciglia. Cluster com-
puting for Web-scale data processing. In Proceedings of the 39th ACM Techni-
cal Symposium on Computer Science Education (SIGCSE 2008), pages 116–120,
Portland, Oregon, 2008.
[84] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal
of the ACM, 46(5):604–632, 1999.
[85] Philipp Koehn. Statistical Machine Translation. Cambridge University Press,
Cambridge, England, 2010.
[86] Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based trans-
lation. In Proceedings of the 2003 Human Language Technology Conference of
the North American Chapter of the Association for Computational Linguistics
(HLT/NAACL 2003), pages 48–54, Edmonton, Alberta, Canada, 2003.
7.3. MAPREDUCE AND BEYOND 165
[87] John D. Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. In Pro-
ceedings of the Eighteenth International Conference on Machine Learning (ICML
’01), pages 282–289, San Francisco, California, 2001.
[88] Ronny Lempel and Shlomo Moran. SALSA: The Stochastic Approach for Link-
Structure Analysis. ACM Transactions on Information Systems, 19(2):131–160,
2001.
[89] Abby Levenberg, Chris Callison-Burch, and Miles Osborne. Stream-based transla-
tion models for statistical machine translation. In Proceedings of the 11th Annual
Conference of the North American Chapter of the Association for Computational
Linguistics (NAACL HLT 2010), Los Angeles, California, 2010.
[90] Abby Levenberg and Miles Osborne. Stream-based randomised language models
for SMT. In Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing, pages 756–764, Singapore, 2009.
[91] Adam Leventhal. Triple-parity RAID and beyond. ACM Queue, 7(11), 2009.
[92] Jimmy Lin. An exploration of the principles underlying redundancy-based factoid
question answering. ACM Transactions on Information Systems, 27(2):1–55, 2007.
[93] Jimmy Lin. Exploring large-data issues in the curriculum: A case study with
MapReduce. In Proceedings of the Third Workshop on Issues in Teaching Com-
putational Linguistics (TeachCL-08) at ACL 2008, pages 54–61, Columbus, Ohio,
2008.
[94] Jimmy Lin. Scalable language processing algorithms for the masses: A case study
in computing word co-occurrence matrices with MapReduce. In Proceedings of the
2008 Conference on Empirical Methods in Natural Language Processing (EMNLP
2008), pages 419–428, Honolulu, Hawaii, 2008.
[95] Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-
latency, high-throughput access to static global resources within the Hadoop
framework. Technical Report HCIL-2009-01, University of Maryland, College
Park, Maryland, January 2009.
[96] Dong C. Liu, Jorge Nocedal, Dong C. Liu, and Jorge Nocedal. On the limited
memory BFGS method for large scale optimization. Mathematical Programming
B, 45(3):503–528, 1989.
[97] Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–
49, 2008.
166 CHAPTER 7. CLOSING REMARKS
[98] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan
Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale
graph processing. In Proceedings of the 28th ACM Symposium on Principles of
Distributed Computing (PODC 2009), page 6, Calgary, Alberta, Canada, 2009.
[99] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert,
Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-
scale graph processing. In Proceedings of the 2010 ACM SIGMOD International
Conference on Management of Data, Indianapolis, Indiana, 2010.
[100] Robert Malouf. A comparison of algorithms for maximum entropy parameter
estimation. In Proceedings of the Sixth Conference on Natural Language Learning
(CoNLL-2002), pages 49–55, Taipei, Taiwan, 2002.
[101] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. An Introduc-
tion to Information Retrieval. Cambridge University Press, Cambridge, England,
2008.
[102] Christopher D. Manning and Hinrich Sch¨ utze. Foundations of Statistical Natural
Language Processing. MIT Press, Cambridge, Massachusetts, 1999.
[103] Elaine R. Mardis. The impact of next-generation sequencing technology on ge-
netics. Trends in Genetics, 24(3):133–141, 2008.
[104] Michael D. McCool. Scalable programming models for massively multicore pro-
cessors. Proceedings of the IEEE, 96(5):816–831, 2008.
[105] Marshall K. McKusick and Sean Quinlan. GFS: Evolution on fast-forward. ACM
Queue, 7(7), 2009.
[106] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory
hierarchy performance for irregular applications using data and computation re-
orderings. International Journal of Parallel Programming, 29(3):217–247, 2001.
[107] Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy. Building enriched
document representations using aggregated anchor text. In Proceedings of the
32nd Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR 2009), pages 219–226, 2009.
[108] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden Markov model
information retrieval system. In Proceedings of the 22nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR 1999), pages 214–221, Berkeley, California, 1999.
7.3. MAPREDUCE AND BEYOND 167
[109] Alistair Moffat, William Webber, and Justin Zobel. Load balancing for term-
distributed parallel retrieval. In Proceedings of the 29th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval
(SIGIR 2006), pages 348–355, Seattle, Washington, 2006.
[110] Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy
for text classification. In Proceedings of the IJCAI-99 Workshop on Machine
Learning for Information Filtering, pages 61–67, Stockholm, Sweden, 1999.
[111] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman,
Lamia Youseff, and Dmitrii Zagorodnov. The Eucalyptus open-source cloud-
computing system. In Proceedings of the 9th IEEE/ACM International Sym-
posium on Cluster Computing and the Grid, pages 124–131, Washington, D.C.,
2009.
[112] Franz J. Och and Hermann Ney. A systematic comparison of various statistical
alignment models. Computational Linguistics, 29(1):19–51, 2003.
[113] Christopher Olston and Marc Najork. Web crawling. Foundations and Trends in
Information Retrieval, 4(3):175–246, 2010.
[114] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and An-
drew Tomkins. Pig Latin: A not-so-foreign language for data processing. In Pro-
ceedings of the 2008 ACM SIGMOD International Conference on Management of
Data, pages 1099–1110, Vancouver, British Columbia, Canada, 2008.
[115] Kunle Olukotun and Lance Hammond. The future of microprocessors. ACM
Queue, 3(7):27–34, 2005.
[116] Sean Owen and Robin Anil. Mahout in Action. Manning Publications Co., Green-
wich, Connecticut, 2010.
[117] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The Page-
Rank citation ranking: Bringing order to the Web. Stanford Digital Library Work-
ing Paper SIDL-WP-1999-0120, Stanford University, 1999.
[118] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations
and Trends in Information Retrieval, 2(1–2):1–135, 2008.
[119] David A. Patterson. The data center is the computer. Communications of the
ACM, 52(1):105, 2008.
[120] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. De-
Witt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to
168 CHAPTER 7. CLOSING REMARKS
large-scale data analysis. In Proceedings of the 35th ACM SIGMOD International
Conference on Management of Data, pages 165–178, Providence, Rhode Island,
2009.
[121] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming first story detec-
tion with application to Twitter. In Proceedings of the 11th Annual Conference
of the North American Chapter of the Association for Computational Linguistics
(NAACL HLT 2010), Los Angeles, California, 2010.
[122] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the
data: Parallel analysis with Sawzall. Scientific Programming Journal, 13(4):277–
298, 2005.
[123] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andr´e Barroso. Failure trends
in a large disk drive population. In Proceedings of the 5th USENIX Conference
on File and Storage Technologies (FAST 2007), San Jose, California, 2008.
[124] Xiaoguang Qi and Brian D. Davison. Web page classification: Features and algo-
rithms. ACM Computing Surveys, 41(2), 2009.
[125] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected appli-
cations in speech recognition. In Readings in Speech Recognition, pages 267–296.
Morgan Kaufmann Publishers, San Francisco, California, 1990.
[126] M. Mustafa Rafique, Benjamin Rose, Ali R. Butt, and Dimitrios S. Nikolopou-
los. Supporting MapReduce on large-scale asymmetric multi-core clusters. ACM
Operating Systems Review, 43(2):25–34, 2009.
[127] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Chris-
tos Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems.
In Proceedings of the 13th International Symposium on High-Performance Com-
puter Architecture (HPCA 2007), pages 205–218, Phoenix, Arizona, 2007.
[128] Delip Rao and David Yarowsky. Ranking and semi-supervised classification
on large scale graphs using Map-Reduce. In Proceedings of the ACL/IJCNLP
2009 Workshop on Graph-Based Methods for Natural Language Processing
(TextGraphs-4), Singapore, 2009.
[129] Michael A. Rappa. The utility business model and the future of computing ser-
vices. IBM Systems Journal, 34(1):32–42, 2004.
[130] Sheldon M. Ross. Stochastic processes. Wiley, New York, 1996.
7.3. MAPREDUCE AND BEYOND 169
[131] Thomas Sandholm and Kevin Lai. MapReduce optimization using regulated dy-
namic prioritization. In Proceedings of the Eleventh International Joint Confer-
ence on Measurement and Modeling of Computer Systems (SIGMETRICS ’09),
pages 299–310, Seattle, Washington, 2009.
[132] Michael Schatz. High Performance Computing for DNA Sequence Alignment and
Assembly. PhD thesis, University of Maryland, College Park, 2010.
[133] Frank Schmuck and Roger Haskin. GPFS: A shared-disk file system for large
computing clusters. In Proceedings of the First USENIX Conference on File and
Storage Technologies, pages 231–244, Monterey, California, 2002.
[134] Donovan A. Schneider and David J. DeWitt. A performance evaluation of four
parallel join algorithms in a shared-nothing multiprocessor environment. In Pro-
ceedings of the 1989 ACM SIGMOD International Conference on Management of
Data, pages 110–121, Portland, Oregon, 1989.
[135] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM errors
in the wild: A large-scale field study. In Proceedings of the Eleventh Interna-
tional Joint Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS ’09), pages 193–204, Seattle, Washington, 2009.
[136] Hinrich Sch¨ utze. Automatic word sense discrimination. Computational Linguis-
tics, 24(1):97–123, 1998.
[137] Hinrich Sch¨ utze and Jan O. Pedersen. A cooccurrence-based thesaurus and two
applications to information retrieval. Information Processing and Management,
33(3):307–318, 1998.
[138] Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, Classifica-
tion and Use. John Benjamins, Amsterdam, The Netherlands, 2009.
[139] Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. Learning hidden
Markov model structure for information extraction. In Proceedings of the AAAI-99
Workshop on Machine Learning for Information Extraction, pages 37–42, Or-
lando, Florida, 1999.
[140] Fei Sha and Fernando Pereira. Shallow parsing with conditional random
fields. In Proceedings of the 2003 Human Language Technology Conference of
the North American Chapter of the Association for Computational Linguistics
(HLT/NAACL 2003), pages 134–141, Edmonton, Alberta, Canada, 2003.
[141] Noah Smith. Log-linear models. http://www.cs.cmu.edu/
~
nasmith/papers/
smith.tut04.pdf, 2004.
170 CHAPTER 7. CLOSING REMARKS
[142] Christopher Southan and Graham Cameron. Beyond the tsunami: Developing the
infrastructure to deal with life sciences data. In Tony Hey, Stewart Tansley, and
Kristin Tolle, editors, The Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Research, Redmond, Washington, 2009.
[143] Mario Stanke and Stephan Waack. Gene prediction with a hidden Markov model
and a new intron submodel. Bioinformatics, 19 Suppl 2:ii215–225, October 2003.
[144] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson,
Andrew Pavlo, and Alexander Rasin. MapReduce and parallel DBMSs: Friends
or foes? Communications of the ACM, 53(1):64–71, 2010.
[145] Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar, Jim Gray, Don Slutz, and
Robert J. Brunner. Designing and mining multi-terabyte astronomy archives:
The Sloan Digital Sky Survey. SIGMOD Record, 29(2):451–462, 2000.
[146] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson. Data-intensive file sys-
tems for Internet services: A rose by any other name. . . . Technical Report CMU-
PDL-08-114, Parallel Data Laboratory, Carnegie Mellon University, 2008.
[147] Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A
scalable distributed file system. In Proceedings of the 16th ACM Symposium on
Operating Systems Principles (SOSP 1997), pages 224–237, Saint-Malo, France,
1997.
[148] Leslie G. Valiant. A bridging model for parallel computation. Communications
of the ACM, 33(8):103–111, 1990.
[149] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A break
in the clouds: Towards a cloud definition. ACM SIGCOMM Computer Commu-
nication Review, 39(1):50–55, 2009.
[150] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word align-
ment in statistical translation. In Proceedings of the 16th International Confer-
ence on Computational Linguistics (COLING 1996), pages 836–841, Copenhagen,
Denmark, 1996.
[151] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang.
PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceed-
ings of the Fifth International Conference on Algorithmic Aspects in Information
and Management (AAIM 2009), pages 301–314, San Francisco, California, 2009.
[152] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’
networks. Nature, 393:440–442, 1998.
7.3. MAPREDUCE AND BEYOND 171
[153] Xingzhi Wen and Uzi Vishkin. FPGA-based prototype of a PRAM-On-Chip
processor. In Proceedings of the 5th Conference on Computing Frontiers, pages
55–66, Ischia, Italy, 2008.
[154] Tom White. Hadoop: The Definitive Guide. O’Reilly, Sebastopol, California, 2009.
[155] Eugene Wigner. The unreasonable effectiveness of mathematics in the natural
sciences. Communications in Pure and Applied Mathematics, 13(1):1–14, 1960.
[156] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Com-
pressing and Indexing Documents and Images. Morgan Kaufmann Publishing,
San Francisco, California, 1999.
[157] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word
variants. ACM Transactions on Information Systems, 16(1):61–81, 1998.
[158] Rui Xu and Donald Wunsch II. Survey of clustering algorithms. IEEE Transac-
tions on Neural Networks, 16(3):645–678, 2005.
[159] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,
´
Ulfar Erlingsson,
Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A system for general-
purpose distributed data-parallel computing using a high-level language. In Pro-
ceedings of the 8th Symposium on Operating System Design and Implementation
(OSDI 2008), pages 1–14, San Diego, California, 2008.
[160] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott
Shenker, and Ion Stoica. Job scheduling for multi-user MapReduce clusters. Tech-
nical Report UCB/EECS-2009-55, Electrical Engineering and Computer Sciences,
University of California at Berkeley, 2009.
[161] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Sto-
ica. Improving MapReduce performance in heterogeneous environments. In Pro-
ceedings of the 8th Symposium on Operating System Design and Implementation
(OSDI 2008), pages 29–42, San Diego, California, 2008.
[162] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM
Computing Surveys, 38(6):1–56, 2006.

ii

Contents
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Why Is This Different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2

MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3

MapReduce Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.1 Combiners and In-Mapper Combining 3.2 3.3 3.4 3.5 41 46 3.1.2 Algorithmic Correctness with Local Aggregation

Pairs and Stripes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.1 Reduce-Side Join 3.5.2 Map-Side Join 66 67 3.5.3 Memory-Backed Join 64

CONTENTS

iii

3.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4

Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 4.2 4.3 4.4 4.5 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Inverted Indexing: Revised Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Index Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5.1 Byte-Aligned and Word-Aligned Codes 4.5.2 Bit-Aligned Codes 4.5.3 Postings Compression 4.6 4.7 82 84 80

What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5

Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 5.1 5.2 5.3 5.4 5.5 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Summary and Additional Readings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110

6

EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.1.1 Maximum Likelihood Estimation 6.1.2 A Latent Variable Marble Game 6.1.3 MLE with Latent Variables 6.1.4 Expectation Maximization 6.1.5 An EM Example 6.2 120 123 118 119 115 117

Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.1 Three Questions for Hidden Markov Models 6.2.2 The Forward Algorithm 6.2.3 The Viterbi Algorithm 125 126

152 7. . . . . . .3 Word Alignment 6. . . .3 Limitations of MapReduce . .1 Gradient-Based Optimization and Log-Linear Models Summary and Additional Readings. . .4 129 133 6. . . . . . . . . . . . . . . . . . . . . . . . . .1 HMM Training in MapReduce 135 EM in MapReduce . 154 MapReduce and Beyond . . . . . . . .5 6. . . .1 Statistical Phrase-Based Translation 6. . . . . . . .4 Parameter Estimation for HMMs 6. . 147 6. . .4.150 7 Closing Remarks .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Alternative Computing Paradigms . . 134 Case Study: Word Alignment for Statistical Machine Translation . . . . . . . . . .2.1 7. . . . . . . . . . . . . . . . . . .4 Experiments 144 147 143 139 142 6. . . . . . . . . . . . . . . . .3 6. . . . .4. . . . .2. . . . . . . . . . . . . . .5 Forward-Backward Training: Summary 6. . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . .5. . . .iv CONTENTS 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Brief Digression: Language Modeling with MapReduce 6. . . . . . . . . .6 EM-Like Algorithms . . . . . . . 138 6. .4. . .2 7. . . . . . . . . . 156 . . . .

data mining. analyzing. this means scaling up to the web. how many companies do you know that start their pitch with “we’re going to harvest information on the web and. Knowing what users look at. Modern information societies are defined by vast repositories of data. whose development was led by Yahoo (now an Apache project). This observation applies not only to well-established internet companies. Given this focus. this is known as business intelligence. Today. which encompasses a wide range of technologies including data warehousing. what they click on. searching. monitoring. as there is a broadly-held belief that great value lies in insights derived from mining such data. or organizing web content must tackle large-data problems: “web-scale” processing is practically synonymous with data-intensive processing. Broadly. . This book is about scalable approaches to processing large amounts of text with MapReduce. it makes sense to start with the most basic question: Why? There are many answers to this question. leads to better business decisions and competitive advantages. . but also countless startups and niche players as well. In fact. Therefore. Second. a vibrant software ecosystem has sprung up around Hadoop. For many. ”? Another strong area of growth is the analysis of user behavior data. First. across a wide range of text processing applications. logging user behavior generates so much data that many organizations simply can’t cope with the volume. with significant activity in both industry and academia. Any organization built around gathering. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. and either turn the functionality off or throw away data after some time. . “big data” is a fact of the world. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop. and therefore an issue that real-world systems must grapple with. and thus it makes sense to take advantage of the plentiful amounts of data that surround us. more data translates into more effective algorithms. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. Just think. but we focus on two. etc. filtering. any practical application must be able to scale up to datasets of interest. This represents lost opportunities.1 CHAPTER 1 Introduction MapReduce [45] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. and analytics. both public and private. how much time they spend on a web page. or at least a non-trivial fraction thereof.

the Large Hadron Collider (LHC) near Geneva is the world’s largest particle accelerator.web. Given the tendency for individuals and organizations to continuously fill up whatever capacity is available. perhaps 50×.cern. and in the case of bandwidth. On the other hand. INTRODUCTION How much data are we talking about? A few examples: Google grew from processing 100 TB of data a day with MapReduce in 2004 [45] to processing 20 PB a day with MapReduce in 2008 [46]. Petabyte datasets are rapidly becoming the norm.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2 http://www. designed to probe the mysteries of the universe. growing at about 15 terabytes per day. Today. For example: • The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [20].html . • The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored. Facebook revealed2 similarly impressive numbers. Moving beyond the commercial sphere. Shortly thereafter. perhaps 2× improvement during the last quarter century. by recreating conditions shortly following the Big Bang. and the other with 6.dbms2. latency and bandwidth have improved relatively little: in the case of latency. and the trends are clear: our ability to store data is fast overwhelming our ability to process what we store.ch/public/en/LHC/Computing-en.2 CHAPTER 1.5 petabytes of user data. the LHC will produce roughly 15 petabytes of data a year. boasting of 2. When it becomes fully operational. In April 2009.3 • Astronomers have long recognized the importance of a “digital observatory” that would support the data needs of researchers across the globe—the Sloan Digital Sky Survey [145] is perhaps the most well known of these projects. Looking into the future.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [19]. including the fundamental nature of matter. When the telescope comes online around 2015 in Chile.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day.com/2009/05/11/facebook-hadoop-and-hive/ 3 http://public. many have recognized the importance of data management in many scientific disciplines.dbms2. increases in capacity are outpacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [91]. Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude). and delivered to scientists for 1 http://www. a blog post1 was written about eBay’s two enormous data warehouses: one with 2 petabytes of user data. its 3. More distressing. where petabyte-scale datasets are also becoming increasingly common [21]. organized. large-data problems are growing increasingly severe. the Large Synoptic Survey Telescope (LSST) is a wide-field instrument that is capable of observing the entire sky every few days.

particularly computer science.3 further study. a task known as sentiment analysis or opinion mining [118].. such as the words and sequences of words themselves.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Aspects of the representations of the data are called features. or “deep” and more difficult to extract. Another example is to automatically situate texts along a scale of “happiness”. but touches on other types of data as well (e. explore. where interventions can be specifically targeted for an individual. and simulations). the impact of this technology is nothing less than transformative [103]. The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing (NLP) and information retrieval (IR). not merely a luxury or curiosity. is to sort text into categories. Increasingly. One common application. classification. typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application. algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application. sequencing an individual’s genome will be no more complex than getting a blood test today—ushering a new era of personalized medicine. which is useful for local search and pinpointing locations on maps. which may be “superficial” and easy to extract. and some method for capturing regularities in the data. Recent work in these fields is dominated by a data-driven. relational and graph data). scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate. this book is primarily concerned with processing large amounts of text. experiments. which hosts a central repository of sequence data called EMBL-bank. systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility. and mine massive datasets [72]—this has been hailed as the emerging “fourth paradigm” of science [73] (complementing theory. Examples include: Is this email spam or not spam? Is this word part of an address or a location? The first task is easy to understand.g. Although large data comes in a variety of forms. empirical approach. The European Bioinformatics Institute (EBI). Scientists are predicting that. In other areas of academia. has increased storage capacity from 2. Finally. while the second task is an instance of what NLP researchers call named-entity detection [138]. There are three components to this approach: data. Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity. Given the fundamental tenant in modern genetics that genotypes explain phenotypes. Data are called corpora (singular. such as the grammatical relationship between words. in the not-so-distant future. which involves ranking documents by relevance to the user’s query. which has been applied to . representations of the data. Another common application is to rank texts according to some criteria— search is a good example. corpus) by NLP researchers and collections by those from the IR community.

Around 2001. 92]. They explored several classification algorithms (the exact ones aren’t important. There is a growing body of evidence. found that more data led to better accuracy. since people make mistakes. at least in text processing. Superficial word-level features coupled with simple models in most cases trump sophisticated models over deeper features and less data. a question answering (QA) system would directly return the answer: John Wilkes Booth. consider the problem of answering short. but no matter). Furthermore. that of the three components discussed above (data. and not surprisingly. such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. Or better yet. INTRODUCTION everything from understanding political discourse in the blogosphere to predicting the movement of stock prices. fact-based questions such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the user would then have to sort through. you can simply search for the phrase “shot Abraham Lincoln” on the web and look for what appears to its left. As another example. Across many different algorithms. the increase in accuracy was approximately linear in the log of the size of the training data. using this task as the specific example. data probably matters the most. as part of a grammar checker). delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data (while waiting for computers to become faster so we can process the data). But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive. with increasing amounts of training data. features. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter. In 2001. Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that most writers make correct choices (the training data may be noisy. when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. Banko and Brill [14] published what has become a classic paper in natural language processing exploring the effects of training data size on classification accuracy. 53. Consider a simple task such as determining the correct usage of easily confusable words such as “than” and “then” in English. all that matters is the amount of data you have.4 CHAPTER 1. This resulted in an even more controversial recommendation. researchers discovered a far simpler approach to answering such questions based on pattern matching [27. look through multiple instances . they don’t scale well. One can view this as a supervised machine learning problem: we can train a classifier to disambiguate between the options. as we shall see). As it turns out. and then apply the classifier to new instances of the problem (say. Suppose you wanted the answer to the above question. the accuracy of different algorithms converged. algorithms). This problem gained interest in the late 1990s.

[25] described language models trained on up to two trillion words. 4 As . stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. an aside.5 of this phrase and tally up the words that appear to the left. In 2007. the conditional probability of a word is given by the n − 1 previous words. 79]. Nevertheless.4 Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique the authors affectionately referred to as “stupid backoff”. both in terms of intrinsic and application-specific metrics. where V is the number of words in the vocabulary. a simpler technique on more data beat a more sophisticated technique on less data. However.5 Not surprisingly. answers to commonly-asked questions will be stated in obvious ways. It capitalizes on the insight that in a very large text collection (i. 5 As in. Thus. To cope with this sparseness. A language model is a probability distribution that characterizes the likelihood of observing a particular sequence of words. it is interesting to observe the evolving definition of large over the years. Brants et al. such as speech recognition (to determine what the speaker is more likely to have said) and machine translation (to determine which of possible translations is the most fluent. most n-grams—in any language.e. Since there are infinitely many possible strings. even English—will never have been seen. as we will discuss in Section 6. researchers have developed a number of smoothing techniques [35. the web). That is. the probability of a sequence of words can be decomposed into the product of n-gram probabilities. which ultimately yielded better language models. This simple strategy works surprisingly well. Banko and Brill’s paper in 2001 was titled Scaling to Very Very Large Corpora for Natural Language Disambiguation. Even if we treat every word on the web as the training corpus from which to estimate the n-gram probabilities. even with lots and lots of training data! Most modern language models make the Markov assumption: in a n-gram language model. an enormous number of parameters must still be estimated from a training corpus: potentially V n parameters. by the chain rule. language modeling is a more challenging task than simply keeping track of which strings were seen how many times: some number of likely strings will never be encountered.4).. Smoothing approaches vary in effectiveness. and has become known as the redundancy-based approach to question answering. Yet another example concerns smoothing in web-scale language models [25]. They are useful in a variety of applications. estimated from a large corpus of texts. it was simpler and could be trained on more data. such that pattern-matching techniques suffice to extract answers accurately. and dealt with a corpus containing a billion words. 102. so stupid it couldn’t possibly work. and probabilities must be assigned to all of them.

INTRODUCTION Recently.6 CHAPTER 1. groups of individuals write within a shared context.7 The world. This. But the second answer is even more compelling. is messy. or even thousands of machines. which is the exact opposite of the data-driven approach. Data represent the rising tide that lifts all boats—more data lead to better algorithms and systems for solving real-world problems. So. In some ways. but unwarranted hype by the media and popular sources threatens its credibility in the long run. is the best description of itself. universal “laws of grammar”. he would say. just like human behavior in general. and that subjects and verbs must agree in number in many languages—but real-world language is affected by a multitude of other factors as well: people invent new words and phrases all the time. hundreds. in turn. cloud computing 6 This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [155]. This is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural phenomena. There are of course rules that govern the formation of words and sentences—for example. Unlike. translates into more effective algorithms and systems. etc. This is exactly what MapReduce does. . 1.1 COMPUTING IN THE CLOUDS For better or for worse. In the same way. Now that we’ve addressed the why. it is often difficult to untangled MapReduce and large-data processing from the broader discourse on cloud computing. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893). there is substantial promise in this new paradigm of computing. 7 On Exactitude in Science [23]. authors occasionally make mistakes. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about a fictional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing. three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [65]. Let’s start with the obvious observation: dataintensive processing is beyond the capability of any individual machine and requires clusters—which means that large-data problems are fundamentally about organizing computations on dozens. let’s tackle the how. why large data? In some ways. human use of language is not constrained by succinct. True. that verbs appear before objects in English.6 Why is this so? It boils down to the fact that language in the wild. say. and the rest of this book is about the how. the more accurate a description we have of language itself. the more observations we gather about language use. the interaction of subatomic particles. in summary. the first answer is similar to the reason people climb mountains: because they’re there.

and must deal with the practicalities of a federated environment. countless papers have been written simply to attempt to define the term (e. At the most superficial level.9 Another important facet of cloud computing is what’s more precisely known as utility computing [129. grid computing tends to deal with tasks that are coarser-grained.1. a “cloud user” can dynamically provision any amount of computing resources from a “cloud provider” on demand and only pay for what is consumed.g. 9 The first example is Facebook. In this context. In fact. the idea behind utility computing is to treat computing resource as a metered service. grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. 31. This includes social-networking services such as Facebook. Virtualization technology (e. everything that used to be called web applications has been rebranded to become “cloud applications”. So what exactly is cloud computing? This is one of those questions where ten experts will give eleven different answers. in exactly the manner as described [68]. a well-known user of Hadoop. video-sharing sites such as YouTube. of course. As a result. e. 149]. . An online email service analyzes messages and user behavior to optimize ad selection and placement. [9. In practical terms. and in truth isn’t very different from this antiquated form of computing.. which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement.g. As the name implies. just to name a few examples)..0” sites. verifying credentials across multiple administrative domains. they start with different assumptions. The accumulation of vast quantities of user data creates large-data problems. many of which are suitable for MapReduce. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization. Google. COMPUTING IN THE CLOUDS 7 is simply brilliant marketing. in fact. 31]. To give two concrete examples: a social-networking site analyzes connections in the enormous globe-spanning graph of friendships to recommend new connections. webbased email services such as Gmail. there were grids. and user data is said to reside “in the cloud”. and applications such as Google Docs. Here we offer up our own thoughts and attempt to explain how cloud computing relates to MapReduce and data-intensive processing. The second is. and before grids. Grid computing has adopted a middleware-based approach for tackling many of these challenges. there were vector supercomputers.1.g.. the user is paying for access to virtual machine instances that run a standard operating system such as Linux. the cloud simply refers to the servers that power these sites. Under this model. [15]) is used by the cloud provider to allocate available physical resources and enforce isolation between multiple users that may be sharing the 8 8 What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best to bring computational resources to bear on large and difficult problems. which includes what we have previously called “Web 2. Before clouds. like electricity or natural gas. These are all largedata problems that have been tackled with MapReduce. each having claimed to be the best thing since sliced bread. anything running inside a browser that gathers and stores user-generated content now qualifies as an example of cloud computing. The idea harkens back to the days of time-sharing machines.

utility providers aggregate the computing demands for a large number of users. Resource consumption is measured in some equivalent of machine-hours and users are charged in increments thereof. the user has full control over the resources and can use them for arbitrary computation. provisioned resources can be released. Amazon Web Services currently leads the way and remains the dominant player. and users pay only as much as is required to solve their problems. They also gain the important property of elasticity—as demand for computing resources grow. Once one or more virtual machine instances have been provisioned. However. or over-provision and tie up precious capital in idle machines that are depreciating. which is a rebranding of what used to be called hosted services in the “pre-cloud” era. Although demand may fluctuate significantly for each user. Most systems are based on proprietary infrastructure. this may be too low level for many users. Eucalyptus [111]. thereby freeing up physical resources that can be redirected to other users. INTRODUCTION same hardware. but there is at least one. However. more resources can be seamlessly allocated from the cloud without an interruption in service. from an unpredicted spike in customers. or IaaS for short. This lowers the barrier to entry for data-intensive processing and makes MapReduce much more accessible. Virtual machines that are no longer needed are destroyed.8 CHAPTER 1. In the same way that insurance works by aggregating risk and redistributing it. A cloud provider offering customers access to virtual machine instances is said to be offering infrastructure as a service. not everyone with large-data problems can afford to purchase and maintain clusters. Both users and providers benefit in the utility computing model. overall trends in aggregate demand should be smooth and predictable. From the utility provider point of view. but what direct relevance does this have for MapReduce? The connection is quite simple: processing large amounts of data with MapReduce requires access to clusters with sufficient capacity. which is itself a new take on the age-old idea of outsourcing. that is available open source. but a number of other cloud providers populate a market that is becoming increasingly crowded. for example. coping with unexpected spikes in demand was fraught with challenges: under-provision and run the risk of service interruptions. As demand falls. this business also makes sense because large datacenters benefit from economies of scale and can be run more efficiently than smaller datacenters. Increased competition will benefit cloud users. Prior to the advent of utility computing. Users are freed from upfront capital investments necessary to build datacenters and substantial reoccurring costs in maintaining them. which allows the cloud provider to adjust capacity over time with less risk of either offering too much (resulting in inefficient use of capital) or too little (resulting in unsatisfied customers). Enter platform as a service (PaaS). A generalization of the utility computing concept is “everything as a service”. This is where utility computing comes in: clusters of sufficient size can be provisioned only when the need arises. In the world of utility computing. Platform is used generically to refer to any set of well-defined .

1. or interest in running datacenters. Other examples include outsourcing an entire organization’s email to a third party. What does this proliferation of services have to do with MapReduce? No doubt that “everything as a service” is driven by desires for greater business efficiencies. This class of services is best exemplified by Google App Engine. The latter approach of purchasing symmetric multi-processing (SMP) machines with a large number of processor sockets (dozens. or otherwise maintain basic services such as the storage layer or the programming environment. all of these ideas have been discussed in the computer science literature for some time (some for decades).2 BIG IDEAS Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional models of computing. patch. The cloud allows seamless expansion of operations without the need for careful planning and supports scales that may otherwise be difficult or cost-prohibitive for an organization to achieve. 1. IaaS is an abstraction over raw physical hardware—an organization might lack the capital. BIG IDEAS 9 services on top of which users can build applications.e. the scaling “out” approach) is preferred over a small number of high-end servers (i. represents the search for an appropriate level of abstraction and beneficial divisions of labor. To be fair. For data-intensive workloads. and MapReduce is certainly not the first to adopt these ideas. since the costs of such machines do not scale linearly (i.e. expertise. At an even higher level.. the engineers at Google deserve tremendous credit for pulling these various threads together and demonstrating the power of these ideas on a scale previously unheard of. we discuss a number of “big ideas” behind MapReduce. freeing the user from having to backup. the low-end server market overlaps with the . Google maintains the infrastructure. just like MapReduce. the scaling “up” approach).e. as exemplified by Salesforce. In this section.. Scale “out”. etc. a leader in customer relationship management (CRM) software. the MapReduce programming model is a powerful abstraction that separates the what from the how of data-intensive processing. which provides the backend datastore and API for anyone to build highly-scalable web applications. which is commonplace today. Nevertheless. upgrade. a large number of commodity low-end servers (i. even hundreds) and a large amount of shared memory (hundreds or even thousands of gigabytes) is not cost effective. On the other hand.. Cloud services. cloud providers can offer software as a service (SaaS). deploy content. In the same vein. but scale and elasticity play important roles as well. a machine with twice as many processors is often significantly more than twice as expensive). The argument applies similarly to PaaS and SaaS. not “up”. and therefore pays a cloud provider to do so on its behalf.2.

non-profit organization whose mission is to establish objective database benchmarks. cooling. energy efficiency has become a key issue in building warehouse-scale computers for large-data processing. Therefore. 18]. Benchmark data submitted to that organization are probably the closest one can get to a fair “apples-to-apples” comparison of cost and performance for specific. For data-intensive applications. Barroso and H¨lzle’s recent treatise of what they dubbed “warehouse-scale como puters” [18] contains a thoughtful analysis of the two approaches. and conclude that a cluster of low-end servers approaches the performance of the equivalent cluster of high-end servers—the small performance gap is insufficient to justify the price premium of the high-end servers. cooling fans. The third component captures how much of the power delivered . As a result. What if we take into account the fact that communication between nodes in a high-end SMP machine is orders of magnitude faster than communication between nodes in a commodity network-based cluster? Since workloads today are beyond the capability of any single machine (no matter how powerful). and therefore most existing implementations of the MapReduce programming model are designed around clusters of low-end commodity servers. cooling. Excluding storage costs. The first component measures how much of a building’s incoming power is actually delivered to computing equipment.g.10 CHAPTER 1.g. Datacenter efficiency is typically factored into three separate components that can be independently measured and optimized [18]. power distribution inefficiencies). only one component of the total cost of delivering computing capacity. The second component measures how much of a server’s incoming power is lost to the power supply. interchangeable components. which has the effect of keeping prices low due to competition. and correspondingly. of course. the conclusion appears to be clear: scaling “out” is superior to scaling “up”. the comparison is more accurately between a smaller cluster of high-end machines and a larger cluster of low-end machines (network communication is unavoidable in both cases). The Transaction Processing Council (TPC) is a neutral. etc. Barroso and H¨lzle o model these two approaches under workloads that demand more or less communication. and economies of scale. Based on TPC-C benchmark results from late 2007. [67. INTRODUCTION high-volume desktop computing market. Capital costs in acquiring servers is. air handling) and electrical infrastructure (e. etc. how much is lost to the building’s mechanical systems (e.. well-defined relational processing applications. a low-end server platform is about four times more cost efficient than a high-end shared memory platform from the same vendor. it is important to factor in operational costs when deploying a scale-out solution based on large numbers of commodity servers. Operational costs are dominated by the cost of electricity to power the servers as well as other aspects of datacenter operations that are functionally related to power: power distribution.. the price/performance advantage of the low-end server increases to about a factor of twelve.

connectivity loss. As a result. However. A well-designed. power failure. A survey of five thousand Google servers over a six-month period shows that servers operate most of the time at between 10% and 50% utilization [17].2. which is an energyinefficient operating region. however. Even with these reliable servers. but commonplace.) is actually used to perform useful computations. evidence suggests that scaling out remains more attractive than scaling up..g. but yet retain the ability to power up (nearly) instantaneously in response to demand. who provide detailed cost models for typical modern datacenters. a server may fail at any time. Even then. in large clusters disk failures are common [123] and RAM experiences more errors than one might expect [135].g. even factoring in operational costs. where energy consumption would be proportional to load. and overall performance should gracefully degrade as server failures pile up. etc. Adoption of industry best-practices can help datacenter operators achieve state-of-the-art efficiency. For the sake of argument. a server at 10% utilization may draw slightly more than half as much power as a server at 100% utilization (which means that a lightly-loaded server is much less efficient than a heavily-loaded server). datacenter efficiency is a topic that is beyond the scope of this book. a broken server that has been repaired . That is.000-server cluster would still experience roughly 10 failures a day. Assume failures are common. etc. Barroso and H¨lzle have advocated for research o and development in energy-proportional machines. without notice. One important issue that has been identified is the non-linearity between load and power draw. This means that any large-scale service that is distributed across a large cluster (either a user-facing application or a computing platform like MapReduce) must cope with hardware failures as an intrinsic aspect of its operation [66].000-server cluster would still experience one failure daily. a 10. let us suppose that a MTBF of 10.000 days (about thirty years) were achievable at realistic costs (which is unlikely). A simple calculation suffices to demonstrate: let us suppose that a cluster is built from reliable machines with a mean-time between failures (MTBF) of 1000 days (about three years). other cluster nodes should seamlessly step in to handle the load. consult Barroso and H¨lzle [18] o and Hamilton [67]. At warehouse scale. For example. Of the three components of datacenter efficiency.1.. disk. Just as important. Although we have provided a brief overview here. The third component. the first two are relatively straightforward to objectively quantify. fault-tolerant service must cope with failures up to a point without impacting the quality of service—failures should not result in inconsistencies or indeterminism from the user perspective. a 10. is much more difficult to measure. Datacenters suffer from both planned outages (e. That is.). As servers go down. RAM. failures are not only inevitable. BIG IDEAS 11 to computing components (processor. system maintenance and hardware upgrades) and unexpected outages (e. such that an idle processor would (ideally) consume no power. For more details.

Data-intensive processing by definition means that the relevant datasets are too large to fit in memory and must be held on disk. although solid-state disks have substantially faster seek times. First. at least in the near future. Many data-intensive workloads are not very processor-demanding. Move processing to the data. MapReduce is primarily designed for batch processing over large datasets. Given reasonable assumptions about disk latency and throughput. Second. . A simple scenario10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 1010 100-byte records. To the extent possible. it is more efficient to move the processing around. all computations are organized into long streaming operations that 10 Adapted 11 For from a post by Ted Dunning on the Hadoop mailing list. a back-of-the-envelop calculation will show that updating 1% of the records (by accessing and then mutating each record) will take about a month on a single machine. which means that the separation of compute and storage creates a bottleneck in the network.g. As a result.11 The development of solid-state drives is unlikely the change this balance for at least two reasons. we can take advantage of data locality by running code on the processor directly attached to the block of data we need. order-of-magnitude differences in performance between sequential and random access still remain. for climate or nuclear simulations). INTRODUCTION should be able to seamlessly rejoin the service without manual reconfiguration by the administrator. Seek times for random disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. The distributed file system is responsible for managing the data over which MapReduce operates. orders of magnitude faster than random data access.12 CHAPTER 1. As an alternative to moving data around.. literally. Sequential data access is. and instead organize computations so that data is processed sequentially. That is. In traditional high-performance computing (HPC) applications (e. the process would finish in under a work day on a single machine. MapReduce assumes an architecture where processors and storage (disk) are co-located. Process data sequentially and avoid random access. it is desirable to avoid random data access. the cost differential between traditional magnetic disks and solid-state disks remains substantial: large-data will for the most part remain on mechanical drives. Jacobs [76] provides real-world benchmarks in his discussion of large-data problems. it is commonplace for a supercomputer to have “processing nodes” and “storage nodes” linked together by a high-capacity interconnect. On the other hand. In such a setup. Mature implementations of the MapReduce programming model are able to robustly cope with failures through a number of mechanisms such as automatic task restarts on different cluster nodes. more detail. if one simply reads the entire database and rewrites all the records (mutating those that need updating).

g. code is guaranteed to behave as expected. but the truth remains: concurrent programs are notoriously difficult to reason about and even harder to debug. The advantage is that the execution framework only needs to be designed once and verified for correctness—thereafter.. quiet office. one of the key reasons why writing code is difficult is because the programmer must simultaneously keep track of many details in short term memory—ranging from the mundane (e. MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines.g. the biggest headache in distributed programming is that code runs concurrently in unpredictable orders. comfortable furniture. Of course. Programmers are taught to use low-level devices such as mutexes and to apply high-level “design patterns” such as producer–consumer queues to tackle these challenges. BIG IDEAS 13 take advantage of the aggregate bandwidth of many disks in a cluster. As an aspiration. This imposes a high cognitive load and requires intense concentration. We can define scalability along at least two dimensions..1. The programming model specifies simple and well-defined interfaces between a small number of components.g. The first is under the control of the programmer.. accessing data in unpredictable patterns. which leads to a number of recommendations about a programmer’s environment (e. . in terms of resources: given a cluster twice 12 See also DeWitt and Gray [50] for slightly different definitions in terms of speedup and scaleup.12 First. Many aspects of MapReduce’s design explicitly trade latency for throughput. large monitors. processes. According to many guides on the practice of software engineering written by experienced industry professionals.). no more debugging race conditions and addressing lock contention) and can instead focus on algorithm or application design.g. or machines. all else being equal. the same algorithm should take at most twice as long to run. and other well-known problems. The upshot is that the developer is freed from having to worry about system-level details (e. Seamless scalability. and therefore is easy for the programmer to reason about. as long as the developer expresses computations in the programming model. let us sketch the behavior of an ideal algorithm.. variable names) to the sophisticated (e. etc. For data-intensive processing. in terms of data: given twice the amount of data. it goes without saying that scalable algorithms are highly desirable. while the second is exclusively the responsibility of the execution framework or “runtime”. a corner case of an algorithm that requires special treatment).g.).2. Second. data starvation issues in the processing pipeline. MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e. deadlocks.. etc. Hide system-level details from the application developer. This gives rise to race conditions. The challenges in writing distributed software are greatly compounded—the programmer must manage details across several threads. locking of data structures.

an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings: on data ranging from gigabytes to petabytes. Furthermore. the same algorithm should take no more than half as long to run. greater efficiencies gained by parallelization are entirely offset by increased communication requirements. the MapReduce programming model is simple enough that it is actually possible. In the domain of text processing. But what happens when the amount of data doubles in the near future. Other than for embarrassingly parallel problems. to approach the ideal scaling characteristics discussed above. in many circumstances. Furthermore. the ideal algorithm would exhibit these desired behaviors without requiring any modifications whatsoever. The algorithm designer is faced with diminishing returns. and is often illustrated with a cute quote: “nine women cannot have a baby in one month”. Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster. not even tuning of parameters. unobtainable. a MapReduce algorithm remains fixed. One of the fundamental assertions in Fred Brook’s classic The Mythical Man-Month [28] is that adding programmers to a project behind schedule will only make it fall further behind. these fundamental limitations shouldn’t prevent us from at least striving for the unobtainable. the scaling “up” vs. As a result. and then doubles again shortly thereafter? Simply buying more memory is not a viable solution. then we should be able to finish in an hour on a cluster . this is a fair assumption. most algorithms today assume that data fits in memory on a single machine. and it is the responsibility of the execution framework to execute the algorithm. We introduce the idea of the “tradeable machine hour”. of course. and beyond a certain point. Nevertheless. scaling “out” argument). the same is also true of algorithms: increasing the degree of parallelization also increases communication costs.14 CHAPTER 1. If running an algorithm on a particular dataset takes 100 machine hours. For the most part. The truth is that most current algorithms are far from the ideal. algorithms with the characteristics sketched above are. as a play on Brook’s classic title. on clusters consisting of a few to a few thousand machines. the price of a machine does not scale linearly with the amount of available memory beyond a certain point (once again. INTRODUCTION the size. This is because complex tasks cannot be chopped into smaller pieces and allocated in a linear fashion. Quite simply. Amazingly. Finally. Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above. Although Brook’s observations are primarily about software engineers and the software development process. as the amount of data is growing faster than the price of memory is falling. for example. algorithms that require holding intermediate data in memory on a single machine will simply break on sufficiently-large datasets—moving from a single machine to a cluster architecture requires fundamentally different algorithms (and reimplementations).

and then by Yahoo 13 Note that this idea meshes well with utility computing. Unfortunately. large-data problems began popping up everywhere.” — Leslie Valiant [148]14 For several decades. save the needs of a few (scientists simulating molecular interactions or nuclear reactions. for example). parallel processing became an important issue at the forefront of everyone’s mind—it represented the only way forward. rise of user-generated web content. this isn’t so far from the truth. and communication. . Yet. all of that changed around the middle of the first decade of this century.). WHY IS THIS DIFFERENT? 15 13 of 100 machines. relatively few organizations had data-intensive processing needs that required large clusters: a handful of internet companies and perhaps a few dozen large corporations. superscalar architectures. Data-intensive processing needs became widespread. But then. Nevertheless. 14 Guess when this was written? You may be surprised. they have been wrong. computer scientists have predicted that the dawn of the age of parallel computing was “right around the corner” and that sequential processing would soon fade into obsolescence (consider. With MapReduce. the above quote). which drove innovations in distributed computing such as MapReduce—first by Google. and the result has been that parallel processing and distributed systems have largely been confined to a small segment of the market and esoteric upper-level electives in the computer science curriculum. 1. Couple that with the inherent challenges of concurrency. memory. This.3 WHY IS THIS DIFFERENT? “Due to the rapidly decreasing cost of processing. etc. however. deeper pipelines. In the late 1990s and even during the beginning of the first decade of this century.3. However. The manner in which the semiconductor industry had been exploiting Moore’s Law simply ran out of opportunities for improvement: faster clocks. until very recently.1. Through a combination of many different factors (falling prices of disks. at least for some applications. where a 100-machine cluster running for one hour would cost the same as a 10-machine cluster running for ten hours. this radical shift in hardware architecture was not matched at that time by corresponding advances in how software could be easily designed for these new processors (but not for lack of trying [104]). At around the same time. has not happened. everything changed. it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally intensive domains. This marked the beginning of an entirely new strategy and the dawn of the multi-core era [115]. or use a cluster of 10 machines to complete the same task in ten hours. we witnessed the growth of large-data problems. and other tricks of the trade reached a point of diminishing returns that did not justify continued investment. for example. The relentless progress of Moore’s Law for several decades has ensured that most of the world’s problems could be solved by single-processor machines.

Anyone can download the open source Hadoop implementation of MapReduce. and be happily processing terabytes upon terabytes of data within the week. operate synchronously on a shared input to produce some output. they began instrumenting various business processes to gather even more data—driven by the belief that more data leads to deeper insights and greater competitive advantages. the computer scientists are right—the age of parallel computing has begun. the PRAM has been recently prototyped in hardware [153]. 60]. As Barroso puts it. an arbitrary number of processors. Finally. Such a view places too much burden on the software developer to effectively take advantage of available computational resources—it simply is the wrong level of abstraction.16 CHAPTER 1. both in terms of multiple cores in a chip and multiple machines in a cluster (each of which often has multiple cores). Today. however. But beyond that. not over individual machines. the datacenter is the computer [18. it provides a very effective tool for tackling large-data problems. a conceptual bridge between the physical implementation of a machine and the software that is to be executed on that machine. The most prevalent model in theoretical computer science. the von Neumann model isn’t sufficient anymore: we can’t treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect. MapReduce is certainly not the first model of parallel computation that has been proposed. not only are large-data problems ubiquitous. 15 More than a theoretical model. MapReduce is important in how it has changed the way we organize computations at a massive scale. but over entire clusters. Valiant called this a bridging model [148]. The result was extraordinary growth: chip designers churned out successive generations of increasingly powerful processors. the software industry developed software targeted at the model without worrying about the hardware details.15 In the model. Why is MapReduce important? In practical terms. MapReduce represents the first widely-adopted step away from the von Neumann model that has served as the foundation of computer science over the last half plus century. the von Neumann model has served us well: Hardware designers focused on efficient implementations of the von Neumann model and didn’t have to think much about the actual software that would run on the machines. 119]. and software engineers were able to develop applications in high-level languages that exploited those processors. To be fair. Today. INTRODUCTION and the open source community. is the PRAM [77. sharing an unboundedly large memory. but technological solutions for addressing them are widely accessible. MapReduce can be viewed as the first breakthrough in the quest for new abstractions that allow us to organize computations. pay a modest fee to rent a cluster from a utility cloud provider. Other models include LogP [43] and BSP [148]. which dates back several decades. Similarly. This in turn created more demand: when organizations learned about the availability of effective data analysis tools for large datasets. Until recently. .

are imperfect—making certain tasks easier but others more difficult. [82] demonstrated that a large class of PRAM algorithms can be efficiently simulated via MapReduce. We don’t for example.4. This book is about MapReduce algorithm design. 1. particularly for text processing (and related) applications. but suffers from limitations as well. Karloff et al. as we can’t meaningfully answer this question before thoroughly understanding what MapReduce can and cannot do well. impossible (in the case where the detail suppressed by the abstraction is exactly what the user cares about). for example. it is important to understand the relationship between MapReduce and existing models so that we can bring to bear accumulated knowledge about parallel algorithms. this book is explicitly not about Hadoop programming. A final word before we get started. This critique applies to MapReduce: it makes certain large-data problems easier. but rather the first in a new class of programming models that will allow us to more effectively organize computations at a massive scale. For those aspects. what’s next beyond MapReduce? We’re getting ahead of ourselves. discuss APIs. we refer the reader to Tom White’s excellent book. as anyone who has taken an introductory computer science course knows. published by O’Reilly [154]. command-line invocations for running jobs. none of these previous models have enjoyed the success that MapReduce has in terms of adoption and in terms of impact on the daily lives of millions of users. They. inevitably.16 MapReduce is the most successful abstraction over large-scale computational resources we have seen to date.4 WHAT THIS BOOK IS NOT Actually. etc. “Hadoop: The Definitive Guide”. WHAT THIS BOOK IS NOT 17 For reasons that are beyond the scope of this book. 16 Nevertheless.1. Although our presentation most closely follows the Hadoop open-source implementation of MapReduce. not quite yet. . and sometimes. This is exactly the purpose of this book: let us now begin our exploration. . This means that MapReduce is not the final word. abstractions manage complexity by hiding details and presenting well-defined behaviors to users of those abstractions. . So if MapReduce is only the beginning. However.

. Intermediate results from each individual worker are then combined to yield the final output. due to available resources. or many machines in a cluster. For example.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many different application domains. to explicitly handle process synchronization through devices such as barriers. the developer needs to explicitly address many (and sometimes. they can be tackled in parallel by different workers—threads in a processor core. and to remain ever vigilant for common problems such as deadlocks and race conditions. a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. the details of their implementations are varied and complex. etc. the developer needs to explicitly coordinate access to shared data structures through synchronization primitives such as mutexes. the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More specifically. In shared memory programming. The basic idea is to partition a large problem into smaller subproblems.18 CHAPTER 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer.. e. multiple processors in a machine. like 1 We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift. but they are far from being sufficiently mature to solve real world problems. cores in a multi-core processor. Language extensions. locality constraints. To the extent that the sub-problems are independent [5]. However.)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by another? • How do we accomplish all of the above in the face of software errors and hardware faults? In traditional parallel or distributed programming environments. all) of the above issues.g. how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others.

Therefore. This is operationally realized by spreading data across the local disks of nodes in a cluster and running processes on nodes that hold the data. focusing on mappers and reducers. robust. transparently handling most of the details behind the scenes in a scalable. The complex task of managing storage in such a processing environment is typically handled by a distributed file system that sits underneath MapReduce. a complete cluster architecture is described in Section 2. We start in Section 2. MapReduce would not be practical without a tightly-integrated distributed file system that manages the data being processed. Large-data processing by definition requires bringing data and code together for computation to occur—no small feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a simple abstraction for the developer. which detracts from higher-level problem solving.5 covers this in detail. However. When using existing parallel computing approaches for large-data computation.org/ 3 http://www. even with these extensions. 2 2 http://www.3 provide logical abstractions that hide details of operating system synchronization and communications primitives. a developer can focus on what computations need to be performed.3 discusses the role of the execution framework in actually running MapReduce programs (called jobs). or libraries implementing the Message Passing Interface (MPI) for cluster-level parallelism.1 with an overview of functional programming. MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a different level of granularity).mcs. to move the code to the data. these frameworks are mostly designed to tackle processor-intensive problems and have only rudimentary support for dealing with very large amounts of input data.6 before the chapter ends with a summary. Like OpenMP and MPI. Section 2. Section 2. which provide greater control over data flow. it is far more efficient.2 introduces the basic programming model. the programmer must devote a significant amount of attention to low-level system details. As we mentioned in Chapter 1. from which MapReduce draws its inspiration. as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. One of the most significant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the programmer.openmp.4 fills in additional details by introducing partitioners and combiners. if possible.19 OpenMP for shared memory parallelism. and efficient manner. Section 2. Tying everything together. Section 2. Additionally.gov/mpi/ . instead of moving large amounts of data around. However. organizing and coordinating large amounts of computation is only part of the challenge. This chapter introduces the MapReduce programming model and the underlying distributed file system. developers are still burdened to keep track of how resources are made available to workers.anl.

20

CHAPTER 2. MAPREDUCE BASICS

f

f

f

f

f

g

g

g

g

g

Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together in functional programming: map takes a function f and applies it to every element in a list, while fold iteratively applies a function g to aggregate results.

2.1

FUNCTIONAL PROGRAMMING ROOTS

MapReduce has its roots in functional programming, which is exemplified in languages such as Lisp and ML.4 A key feature of functional languages is the concept of higherorder functions, or functions that can accept other functions as arguments. Two common built-in higher order functions are map and fold, illustrated in Figure 2.1. Given a list, map takes as an argument a function f (that takes a single argument) and applies it to all elements in a list (the top part of the diagram). Given a list, fold takes as arguments a function g (that takes two arguments) and an initial value: g is first applied to the initial value and the first item in the list, the result of which is stored in an intermediate variable. This intermediate variable and the next item in the list serve as the arguments to a second application of g, the results of which are stored in the intermediate variable. This process repeats until all items in the list have been consumed; fold then returns the final value of the intermediate variable. Typically, map and fold are used in combination. For example, to compute the sum of squares of a list of integers, one could map a function that squares its argument (i.e., λx.x2 ) over the input list, and then fold the resulting list with the addition function (more precisely, λxλy.x + y) using an initial value of zero. We can view map as a concise way to represent the transformation of a dataset (as defined by the function f ). In the same vein, we can view fold as an aggregation operation, as defined by the function g. One immediate observation is that the application of f to each item in a list (or more generally, to elements in a large dataset)
4 However,

there are important characteristics of MapReduce that make it non-functional in nature—this will become apparent later.

2.1. FUNCTIONAL PROGRAMMING ROOTS

21

can be parallelized in a straightforward manner, since each functional application happens in isolation. In a cluster, these operations can be distributed across many different machines. The fold operation, on the other hand, has more restrictions on data locality—elements in the list must be “brought together” before the function g can be applied. However, many real-world applications do not require g to be applied to all elements of the list. To the extent that elements in the list can be divided into groups, the fold aggregations can also proceed in parallel. Furthermore, for operations that are commutative and associative, significant efficiencies can be gained in the fold operation through local aggregation and appropriate reordering. In a nutshell, we have described MapReduce. The map phase in MapReduce roughly corresponds to the map operation in functional programming, whereas the reduce phase in MapReduce roughly corresponds to the fold operation in functional programming. As we will discuss in detail shortly, the MapReduce execution framework coordinates the map and reduce phases of processing over large amounts of data on large clusters of commodity machines. Viewed from a slightly different angle, MapReduce codifies a generic “recipe” for processing large datasets that consists of two stages. In the first stage, a user-specified computation is applied over all input records in a dataset. These operations occur in parallel and yield intermediate output that is then aggregated by another user-specified computation. The programmer defines these two types of computations, and the execution framework coordinates the actual processing (very loosely, MapReduce provides a functional abstraction). Although such a two-stage processing structure may appear to be very restrictive, many interesting algorithms can be expressed quite concisely— especially if one decomposes complex algorithms into a sequence of MapReduce jobs. Subsequent chapters in this book focus on how a number of algorithms can be implemented in MapReduce. To be precise, MapReduce can refer to three distinct but related concepts. First, MapReduce is a programming model, which is the sense discussed above. Second, MapReduce can refer to the execution framework (i.e., the “runtime”) that coordinates the execution of programs written in this particular style. Finally, MapReduce can refer to the software implementation of the programming model and the execution framework: for example, Google’s proprietary implementation vs. the open-source Hadoop implementation in Java. And in fact, there are many implementations of MapReduce, e.g., targeted specifically for multi-core processors [127], for GPGPUs [71], for the CELL architecture [126], etc. There are some differences between the MapReduce programming model implemented in Hadoop and Google’s proprietary implementation, which we will explicitly discuss throughout the book. However, we take a rather Hadoop-centric view of MapReduce, since Hadoop remains the most mature and accessible implementation to date, and therefore the one most developers are likely to use.

22

CHAPTER 2. MAPREDUCE BASICS

2.2

MAPPERS AND REDUCERS

Key-value pairs form the basic data structure in MapReduce. Keys and values may be primitives such as integers, floating point values, strings, and raw bytes, or they may be arbitrarily complex structures (lists, tuples, associative arrays, etc.). Programmers typically need to define their own custom data types, although a number of libraries such as Protocol Buffers,5 Thrift,6 and Avro7 simplify the task. Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets. For a collection of web pages, keys may be URLs and values may be the actual HTML content. For a graph, keys may represent node ids and values may contain the adjacency lists of those nodes (see Chapter 5 for more details). In some algorithms, input keys are not particularly meaningful and are simply ignored during processing, while in other cases input keys are used to uniquely identify a datum (such as a record id). In Chapter 3, we discuss the role of complex keys and values in the design of various algorithms. In MapReduce, the programmer defines a mapper and a reducer with the following signatures: map: (k1 , v1 ) → [(k2 , v2 )] reduce: (k2 , [v2 ]) → [(k3 , v3 )]

The convention [. . .] is used throughout this book to denote a list. The input to a MapReduce job starts as data stored on the underlying distributed file system (see Section 2.5). The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs. The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.8 Implicit between the map and reduce phases is a distributed “group by” operation on intermediate keys. Intermediate data arrive at each reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across different reducers. Output key-value pairs from each reducer are written persistently back onto the distributed file system (whereas intermediate key-value pairs are transient and not preserved). The output ends up in r files on the distributed file system, where r is the number of reducers. For the most part, there is no need to consolidate reducer output, since the r files often serve as input to yet another MapReduce job. Figure 2.2 illustrates this two-stage processing structure. A simple word count algorithm in MapReduce is shown in Figure 2.3. This algorithm counts the number of occurrences of every word in a text collection, which may be the first step in, for example, building a unigram language model (i.e., probability
5 http://code.google.com/p/protobuf/ 6 http://incubator.apache.org/thrift/ 7 http://hadoop.apache.org/avro/ 8 This

characterization, while conceptually accurate, is a slight simplification. See Section 2.6 for more details.

MAPPERS AND REDUCERS 23 A α B β C γ D δ E ε F ζ mapper mapper mapper mapper a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2. Mappers are applied to all input key-value pairs. counts [c1 . . The reducer sums up all counts for each word. 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. Reducers are applied to all values associated with the same key.2.] do sum ← sum + c Emit(term t. c2 . doc d) for all term t ∈ doc d do Emit(term t. . . The mapper emits an intermediate key-value pair for each word in a document. count sum) Figure 2.]) sum ← 0 for all count c ∈ counts [c1 . Between the map and reduce phases lies a barrier that involves a large distributed sort and group by.2: Simplified view of MapReduce.3: Pseudo-code for the word count algorithm in MapReduce. count 1) class Reducer method Reduce(term t. which generate an arbitrary number of intermediate key-value pairs. c2 . . .2. . .

To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors how MapReduce programs are written in Hadoop. Words within each file will be sorted by alphabetical order. respectively. Mappers and reducers are objects that implement the Map and Reduce methods. In contrast with the number of map tasks. and emits final keyvalue pairs with the word as the key. In configuring a MapReduce job. MAPREDUCE BASICS distribution over words in a collection). one file per reducer. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. Later in Section 3. Input key-values pairs take the form of (docid. There are some differences between the Hadoop implementation of MapReduce and Google’s implementation. In Hadoop.9 In Hadoop. Therefore. controls the assignment of words to reducers. The reducer does exactly this. Google’s implementation allows the programmer to specify a secondary sort key for ordering the values (if desired)—in which case values associated with each key would be presented to the developer’s reduce code in sorted order. The partitioner. the reducer output key must be exactly the same as the reducer input key. The situation is similar for the reduce phase: a reducer object is initialized for each reduce task. . and the Reduce method is called once per intermediate key. and the reducer can emit an arbitrary number of output key-value pairs (with different keys). and the count as the value. The values are arbitrarily ordered. and emits an intermediate key-value pair for every word: the word itself serves as the key. a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the Map method is called on each key-value pair by the execution framework.6). tokenizes the document. doc) pairs stored on the distributed file system.5 and Section 2. we simply need to sum up all counts (ones) associated with each word. and the integer one serves as the value (denoting that we’ve seen the word once). The mapper takes an input key-value pair. and the latter is the text of the document itself. there is no such restriction. and each file will contain roughly the same number of words. The output can be examined by the programmer or used as input to another MapReduce program. In Hadoop. which we discuss later in Section 2. We will return to discuss the details 9 Personal communication.4 we discuss how to overcome this limitation in Hadoop to perform secondary sorting. but the execution framework (see next section) makes the final determination based on the physical layout of the data (more details in Section 2. the programmer provides a hint on the number of map tasks to run. Another difference: in Google’s implementation the programmer is not allowed to change the key in the reducer.24 CHAPTER 2. the reducer is presented with a key and an iterator over all values associated with the particular key.4. in our word count algorithm. the programmer can precisely specify the number of reduce tasks. That is. Jeff Dean. Final output is written to the distributed file system. where the former is a unique identifier for the document.

. which is dependent on an understanding of the distributed file system (covered in Section 2. such as writing files to the distributed file system.2. Finally. our focus is on algorithm design and conceptual understanding—not actual Hadoop programming. This has the effect of sorting and regrouping the input for reduce-side processing.5). This is a powerful and useful feature: for example. While the correctness of such algorithms may be more difficult to guarantee (since the function’s behavior depends not only on the current input but on previous inputs).2. it may be unwise for a mapper to query an external SQL database. most potential synchronization problems are avoided since internal state is private only to individual mappers and reducers. we would recommend Tom White’s book [154].5). What are the restrictions on mappers and reducers? Mappers and reducers can express arbitrary computations over their inputs. and the distributed file system is a shared global resource. since that would introduce a scalability bottleneck on the number of map tasks that could be run in parallel (since they might all be simultaneously querying the database).10 In general. MAPPERS AND REDUCERS 25 of Hadoop job execution in Section 2. Such algorithms can be understood as having side effects that only change state that is internal to the mapper or reducer. and they can differ in type from the intermediate key-value pairs. Since many mappers and reducers are run in parallel. in which case mapper output is directly written to disk (one file per mapper). parse a large text collection or independently analyze a large number of images. Similarly.6. To reiterate: although the presentation of algorithms in this book closely mirrors the way they would be implemented in Hadoop. reducers can emit an arbitrary number of final key-value pairs. of course. in some cases it is useful for the reducer to implement the identity function. MapReduce programs can contain no reducers. special care must be taken to ensure that such operations avoid synchronization conflicts. the database itself is highly scalable. preserving state across multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3).g.4 and Section 6. mappers can emit an arbitrary number of intermediate key-value pairs. One strategy is to write a temporary file that is renamed upon successful completion of the mapper or reducer [45]. one must generally be careful about use of external resources since multiple mappers or reducers may be contending for those resources. Although not permitted in functional programming. this would be a common pattern. e. running identity 10 Unless. in which case the program simply sorts and groups mapper output. The converse—a MapReduce program with no mappers—is not possible. and they need not be of the same type as the input key-value pairs. although in some cases it is useful for the mapper to implement the identity function and simply pass input key-value pairs to the reducers. it may be useful for mappers or reducers to have external side effects. other variations are also possible. . In addition to the “canonical” MapReduce processing flow. For embarrassingly parallel problems. For example. However. mappers and reducers can have side effects. For that. In other cases (see Section 4. Similarly.

g. 160]. distributed. consists of code for mappers and reducers (as well as combiners and partitioners to be discussed in the next section) packaged together with configuration parameters (such as where the input lies and where the output should be stored). this is called the jobtracker) and execution framework (sometimes called the “runtime”) takes care of everything else: it transparently handles all other aspects of distributed code execution. It is not uncommon for MapReduce jobs to have thousands of individual tasks that need to be assigned to nodes in the cluster. a reduce task may handle a portion of the intermediate key space.26 CHAPTER 2..6 for more details). any other system that satisfies the proper abstractions can serve as a data source or sink. input parameters to many instances of processorintensive simulations running in parallel).g. computing π) or may only consume a small amount of data (e. . a map task may be responsible for processing a certain block of input key-value pairs (called an input split in Hadoop). a sparse. policy-driven fashion? There has been some recent work along these lines in the context of Hadoop [131. on clusters ranging from a single node to a few thousand nodes. from different users). A MapReduce program.g. is frequently used as a source of input and as a store of MapReduce output. in some cases MapReduce jobs may not consume any input at all (e. making it necessary for the scheduler to maintain some sort of a task queue and to track the progress of running tasks so that waiting tasks can be assigned to nodes as they become available. which allows a programmer to write MapReduce jobs over database rows and dump output into a new database table. the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently. Although in the most common case. For example. MAPREDUCE BASICS mappers and reducers has the effect of regrouping and resorting the input data (which is sometimes useful). Also. BigTable [34]. Finally. shared resource support several users simultaneously in a predictable.3 THE EXECUTION FRAMEWORK One of the most important idea behind MapReduce is separating the what of distributed processing from the how. Specific responsibilities include: Scheduling. transparent. referred to as a job. Another aspect of scheduling involves coordination among tasks belonging to different jobs (e. input to a MapReduce job comes from data stored on the distributed file system and output is written back to the distributed file system. persistent multidimensional sorted map. HBase is an open-source BigTable clone and has similar capabilities.. similarly. The developer submits the job to the submission node of a cluster (in Hadoop. Hadoop has been integrated with existing MPP (massively parallel processing) relational databases. Each MapReduce job is divided into smaller units called tasks (see Section 2. 2. How can a large. With Google’s MapReduce implementation. In large jobs..

and the framework simply uses the result of the first task attempt to finish.11 To achieve data locality. on its local drive) needed by the task. and the necessary data will be streamed over the network. The phrase data distribution is misleading. Better local aggregation. since one of the key ideas behind MapReduce is to move the code. since inter-rack bandwidth is significantly less than intra-rack bandwidth. however. Zaharia et al. With speculative execution. [161] presented different execution strategies in a recent paper.e.g. In MapReduce. An important optimization here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant data block. If this is not possible (e. or tasks that take an usually long time to complete. the scheduler starts tasks on the node that holds a particular block of data (i.. As a result. the map phase of a job is only as fast as the slowest map task. not the data. which means that the task or tasks responsible for processing the most frequent few elements will run much longer than the typical task. In general. synchronization refers to the mechanisms by which multiple concurrently running processes “join up”. an identical copy of the same task is executed on a different machine. and Google has reported that speculative execution can improve job running times by 44% [45]. Due to the barrier between the map and reduce tasks. the more general point remains—in order for computation to occur.2. One cause of stragglers is flaky hardware: for example. synchronization is accomplished by a barrier between the map and reduce phases of processing. However. Synchronization. new tasks will be started elsewhere. Note. is one possible solution to this problem. this issue is inexplicably intertwined with scheduling and relies heavily on the design of the underlying distributed file system. for example.3. we need to somehow feed data to the code. that speculative execution cannot adequately address another common cause of stragglers: skew in the distribution of values associated with intermediate keys (leading to reduce stragglers). Similarly. the common wisdom is that the technique is more helpful for map tasks than reduce tasks. the speed of a MapReduce job is sensitive to what are known as stragglers. Although in Hadoop both map and reduce tasks can be speculatively executed. the completion time of a job is bounded by the running time of the slowest reduce task. to share intermediate results or otherwise exchange state information.. that is. THE EXECUTION FRAMEWORK 27 Speculative execution is an optimization that is implemented by both Hadoop and Google’s MapReduce implementation (called “backup tasks” [45]). . a node is already running too many tasks). discussed in the next chapter. Recall that MapReduce may receive its input from other sources. since each copy of the reduce task needs to pull data over the network. a machine that is suffering from recoverable errors may become significantly slower. In text processing we often observe Zipfian distributions. which is accomplished by a large distributed 11 In the canonical case. In MapReduce. Data/code co-location. Intermediate key-value pairs must be grouped by key. This has the effect of moving code to the data.

However. it is possible to start copying intermediate key-value pairs over the network to the nodes running the reducers as soon as each mapper finishes—this is a common optimization and implemented in Hadoop. keys are processed in sorted order (which is how the “group by” is implemented). In other words. disk failures are common [123] and RAM experiences more errors than one might expect [135].g. since the execution framework cannot otherwise guarantee that all values associated with the same key have been gathered. MAPREDUCE BASICS sort involving all the nodes that executed map tasks and all the nodes that will execute reduce tasks. the runtime must be especially resilient. The MapReduce execution framework must thrive in this hostile environment.. Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. not the exception. Large-data problems have a penchant for uncovering obscure corner cases in code that is otherwise thought to be bug-free. logged. since each mapper may have intermediate output going to every reducer. and recovered from. And that’s just hardware. any sufficiently large dataset will contain corrupted data or records that are mangled beyond a programmer’s imagination—resulting in errors that one would never think to check for or trap. Within each reducer. In large clusters.g.. connectivity loss.4 PARTITIONERS AND COMBINERS We have thus far presented a simplified view of MapReduce. The MapReduce execution framework must accomplish all the tasks above in an environment where errors and faults are the norm. Datacenters suffer from both planned outages (e. 2. Furthermore.). This is an important departure from functional programming: in a fold operation. power failure. Since MapReduce was explicitly designed around low-end commodity servers. There are two additional elements that complete the programming model: partitioners and combiners. the reducer in MapReduce receives all values associated with the same key at once. Note that the reduce computation cannot start until all the mappers have finished emitting key-value pairs and all intermediate key-value pairs have been shuffled and sorted. Error and fault handling. the partitioner specifies the task to which an intermediate key-value pair must be copied. etc. No software is bug free—exceptions must be appropriately trapped. and therefore the process is commonly known as “shuffle and sort”. In contrast. The . A MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations.28 CHAPTER 2. the aggregation function g is a function of the intermediate value and the next item in the list—which means that values can be lazily generated and aggregation can begin as soon as values are available. system maintenance and hardware upgrades) and unexpected outages (e. This necessarily involves copying intermediate data over the network.

the execution framework reserves the right to use combiners at its discretion.12 In cases where an operation is both associative and commutative (e. PARTITIONERS AND COMBINERS 29 simplest partitioner involves computing the hash value of the key and then taking the mod of that value with the number of reducers. The combiner in MapReduce supports such an optimization. One can think of combiners as “mini-reducers” that take place on the output of the mappers.. This is clearly inefficient.2 discusses this in more detail. but before the user reducer code runs. one cannot assume that a combiner will have the opportunity to process all values associated with the same key. i. This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipfian distribution of word occurrences.. that the partitioner only considers the key and ignores the value—therefore.1. In many cases. Critically. Section 3. i. In general. which emits a key-value pair for each word in the collection.3. to compute a local count for a word over all the documents processed by the mapper. Each combiner operates in isolation and therefore does not have access to intermediate output from other mappers. which focuses on various techniques for local aggregation. We can motivate the need for combiners by considering the word count algorithm in Figure 2. one. or multiple times. It suffices to say for now that 12 A note on the implementation of combiners in Hadoop: by default. combiners in Hadoop may actually be invoked in the reduce phase. combiners must be carefully written so that they can be executed in these different environments. however. One solution is to perform local aggregation on the output of each mapper. This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function). but the keys and values must be of the same type as the mapper output (same as the reducer input). In addition. reducers can directly serve as combiners. Combiners are an optimization in MapReduce that allow for local aggregation before the shuffle and sort phase.e. after key-value pairs have been copied over to the reducer.1. reducers and combiners are not interchangeable.2.g. and so the amount of intermediate data will be larger than the input collection itself. proper use of combiners can spell the difference between an impractical algorithm and an efficient algorithm.e. this means that a combiner may be invoked zero. The combiner can emit any number of key-value pairs.. This topic will be discussed in Section 3. all these key-value pairs need to be copied across the network. the number of intermediate key-value pairs will be at most the number of unique words in the collection times the number of mappers (and typically far smaller because each mapper may not encounter every word). Furthermore. As a result. Note. a roughly-even partitioning of the key space may nevertheless yield large differences in the number of key-values pairs sent to each reducer (since different keys may have different numbers of associated values).4. addition or multiplication). The combiner is provided keys and values associated with each key (the same types as the mapper output keys and values). In reality. prior to the shuffle and sort phase. With this modification (assuming the maximum amount of local aggregation possible). however. .

partitioners are actually executed before combiners. reducer. Partitioners determine which reducer is responsible for a particular key. The partitioner determines which reducer will be responsible for processing a particular key. so while Figure 2.30 CHAPTER 2. Output of the mappers are processed by the combiners. The complete MapReduce model is shown in Figure 2.4. it doesn’t precisely describe the Hadoop implementation. along with job configuration parameters.13 Therefore. and the execution framework uses this information to copy the data to the right location during the shuffle and sort phase. The execution framework handles everything else. combiner.4: Complete view of MapReduce. illustrating combiners and partitioners in addition to mappers and reducers. a combiner can significantly reduce the amount of data that needs to be copied over the network. 13 In Hadoop. a complete MapReduce job consists of code for the mapper. . Combiners can be viewed as “mini-reducers” in the map phase.4 is conceptually accurate. which perform local aggregation to cut down on the number of intermediate key-value pairs. MAPREDUCE BASICS A α B β C γ D δ E ε F ζ mapper pp mapper pp mapper pp mapper pp a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combiner combiner combiner combiner a 1 b 2 c 9 a 5 c 2 b 7 c 8 p partitioner p partitioner p partitioner p partitioner Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2. resulting in much faster algorithms. and partitioner.

in the open-source world. location of blocks. when properly tuned or modified for MapReduce workloads [146. the processing cycle remains the same at a high level: the compute nodes fetch input from storage. and access permissions) and the slaves manage the actual 14 However. this is not a cost-effective solution. 6]. In high-performance computing (HPC) and many traditional cluster architectures. there is evidence that existing POSIX-based distributed cluster file systems (e. is not a new idea. GPFS or PVFS) can serve as a replacement for HDFS. The Google File System (GFS) [57] supports Google’s proprietary implementation of MapReduce. THE DISTRIBUTED FILE SYSTEM 31 2. we have mostly focused on the processing aspect of data-intensive processing. [57] in the original GFS paper.. At that point.5 THE DISTRIBUTED FILE SYSTEM So far.g. and then write back the results (with perhaps intermediate checkpointing for long-running processes). Alternatively. there is nothing to compute on. In most cases. HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop. This. Blocking data. process the data. but it is important to recognize that without data. 147. 133]. The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster.14 Of course. 10 gigabit Ethernet) or special-purpose interconnects such as InfiniBand (even more expensive).. one could abandon the separation of computation and storage as distinct components in a cluster.5. load the data into memory. a switch with ten times the capacity is usually more than ten times more expensive). Implementations vary widely. of course. however. directory structure.2. Regardless of the details. and therefore departs from previous architectures in certain respects (see discussion by Ghemawat et al.g. one could invest in higher performance but more expensive networks (e. 7. supercomputers often have dedicated subsystems for handling storage (separate nodes. But as compute capacity grows. it is difficult to realize many of the advantages of the programming model without a storage substrate that behaves much like the DFS. The distributed file system (DFS) that underlies MapReduce adopts exactly this approach. as the price of networking equipment increases non-linearly with performance (e. Although MapReduce doesn’t necessarily require the distributed file system. The distributed file system adopts a master–slave architecture in which the master maintains the file namespace (metadata. but DFS blocks are significantly larger than block sizes in typical single-machine file systems (64 MB by default). 32. file to block mapping.. and often even separate networks). distributed file systems are not new [74. The MapReduce distributed file system builds on previous work but is specifically adapted to large-data processing workloads.). .g. storage is viewed as a distinct and separate component from computation. the link between the compute nodes and the storage becomes a bottleneck. remains an experimental use case. more compute capacity is required for processing. As dataset sizes increase. but network-attached storage (NAS) and storage area networks (SAN) are common.

the master is called the GFS master.5: The architecture of HDFS. or they may refer to daemons running on those machines providing the relevant services. block id) /foo/bar File namespace block 3df2 instructions to datanode datanode state (block id. which datanode).g.15 This book adopts the Hadoop terminology. Replicating blocks across physical machines also increases oppor15 To be precise. Linux). In response to the client request. so HDFS is resilient towards two common failure scenarios: individual datanode crashes and failures in networking equipment that bring an entire rack offline. The namenode (master) is responsible for maintaining the file namespace and directing clients to datanodes (slaves) that actually hold data blocks containing user data. Blocks are themselves stored on standard single-machine file systems. MAPREDUCE BASICS HDFS namenode Application HDFS Client (block id. Instead. an application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored. redrawn from a similar diagram describing GFS [57]. although for most basic file operations GFS and HDFS work much the same way.5. and performance.. The architecture of HDFS is shown in Figure 2. The client then contacts the datanode to retrieve the data. the same roles are filled by the namenode and datanodes. block location) (file name. In Hadoop. and the slaves are called GFS chunkservers. communications with the namenode only involves transfer of metadata.. the namenode returns the relevant block id and the location where the block is held (i. By default. namenode and datanode may refer to physical machines in a cluster. respectively. the three replicas are spread across different physical racks. In HDFS. An important feature of the design is that data is never moved through the namenode. byte range) block data HDFS datanode Linux file system HDFS datanode Linux file system … … Figure 2. all data transfer occurs directly between clients and datanodes.e. so HDFS lies on top of the standard OS stack (e. In GFS. HDFS stores three separate copies of each data block to ensure both reliability. data blocks. .32 CHAPTER 2. In large clusters. availability.

g. which is a feature already present in GFS.16 if there are too many replicas (e. the application client first contacts the namenode.20. . The namenode is in periodic contact with the datanodes via heartbeat messages to ensure the integrity of the system. The namenode directs application clients to datanodes for read operations. and access permissions. • Maintaining overall health of the file system. To create a new file and write data to HDFS. which includes metadata. From the initial datanode. The namenode is in periodic communication with the datanodes to ensure proper replication of all the blocks: if there aren’t enough replicas (e. but data transfer occurs directly from datanode to datanode. location of blocks. this is a manually-invoked process. THE DISTRIBUTED FILE SYSTEM 33 tunities to co-locate data and processing in the scheduling of MapReduce jobs. the HDFS namenode has the following responsibilities: • Namespace management.. These data are held in memory for fast access and all mutations are persistently logged. rebalancing involves moving blocks from datanodes with more blocks to datanodes with fewer blocks. In the most recent release of Hadoop as of this writing (release 0. • Coordinating file operations.5. certain datanodes may end up holding more blocks than others. The namenode is responsible for maintaining the file namespace.17 During the course of normal operations. file to block mapping. the namenode directs the creation of additional copies. rather..2). If the namenode observes that a data block is under-replicated (fewer copies are stored on datanodes than the desired replication factor). due to disk or machine failures or to connectivity losses due to networking equipment failures). the namenode is also responsible for rebalancing the file system. There are current plans to officially support file appends in the near future. it will direct the creation of new replicas. and the application is directed to stream data directly to it. Finally. The namenode allocates a new block on a suitable datanode. This leads to better load balancing and more even disk utilization. since multiple copies yield more opportunities to exploit locality. When a file is deleted. HDFS does not immediately reclaim the available physical storage. blocks are lazily garbage collected. directory structure. 17 In Hadoop. extra copies are discarded. 16 Note that the namenode coordinates the replication process. All data transfers occur directly between clients and datanodes. a repaired node rejoins the cluster). which updates the file namespace after checking permissions and making sure the file doesn’t already exist.g. and allocates blocks on suitable datanodes for write operations. In summary. files are immutable—they cannot be modified after creation.2. data is further propagated to additional replicas.

each block in HDFS occupies about 150 bytes of memory on the namenode. but in HDFS multi-gigabyte files are common (and even encouraged). This simplifies the design of the distributed file system. In addition. the startup costs of mappers may become significant compared to the time spent actually processing input key-value pairs. The definition of “modest” varies by the size of the deployment. Due to the common-case workload. and in essence pushes part of the data management onto the end application. This exactly describes the nature of MapReduce jobs. since the distributed file system is built on top of a standard operating system such as Linux. One rationale for this decision is that each application knows best how to handle data specific to that application. there is no default mechanism in Hadoop that allows a mapper to process multiple files. respectively. Since the namenode must hold all file metadata in memory. 19 However. mapping over many small files will yield as many map tasks as there are files. Neither HDFS nor GFS present a general POSIX-compliant API. dominated by long streaming reads and large sequential writes. they were designed with a number of assumptions about the operational environment. There are several reasons why lots of small files are to be avoided. which in turn influenced the design of the systems. both HDFS and GFS do not implement any form of data caching. • Workloads are batch oriented. As a result. mappers in a MapReduce job use individual files as a basic unit for splitting input data. This results in two potential problems: first. . this may result in an excessive amount of across-the-network copy operations during the “shuffle and sort” phase (recall that a MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations). second. As a result. in terms of resolving inconsistent states and optimizing the layout of data structures. there is still OS-level caching.18 Large multi-block files represent a more efficient use of namenode memory than many single-block files (each of which consumes less space than a single block size). this presents an upper bound on both the number of files and blocks that can be supported. high sustained bandwidth is more important than low latency. MAPREDUCE BASICS Since GFS and HDFS were specifically designed to support Google’s proprietary and the open-source implementation of MapReduce. 18 According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008. for example. Understanding these choices is critical to designing effective MapReduce algorithms: • The file system stores a relatively modest number of large files.34 CHAPTER 2.19 • Applications are aware of the characteristics of the distributed file system. At present. but rather support only a subset of possible file operations. which are batch operations on large amounts of data.

not to mention requiring a more complex implementation (cf. the namenode rarely is the bottleneck. In practice. since if the master goes offline. If the master (HDFS namenode or GFS master) goes down.20 • The system is built from unreliable but inexpensive commodity components. there are existing plans to integrate Kerberos into Hadoop/HDFS. Because of this. This weakness is mitigated in part by the lightweight nature of file system operations. [4. Since partitioning is unavoidable in large-data systems. the real tradeoff is between consistency and availability. simultaneously providing consistency. but HDFS explicitly assumes a datacenter environment where only authorized users have access. 20 However. Recall that no data is ever moved through the namenode and that all communication between clients and datanodes involve only metadata. HDFS is designed around a number of self-monitoring and self-healing mechanisms to robustly cope with common failure modes. and for the most part avoids loadinduced crashes. some discussion is necessary to understand the single-master design of HDFS and GFS. A single-master design trades availability for consistency and significantly simplifies implementation. Finally. and partition tolerance is impossible—this is Brewer’s so-called CAP Theorem [58]. THE DISTRIBUTED FILE SYSTEM 35 • The file system is deployed in an environment of cooperative users. the entire file system and all MapReduce jobs running on top of it will grind to a halt. File permissions in HDFS are only meant to prevent unintended operations and can be easily circumvented. this single point of failure is not as severe a limitation as it may appear—with diligent monitoring of the namenode. availability. which trivially guarantees that the file system will never be in an inconsistent state.5. failures are the norm rather than the exception. The single-master design of GFS and HDFS is a well-known weakness. a warm standby namenode that can be quickly switched over when the primary namenode fails. The open source environment and the fact that many organizations already depend on Hadoop for production systems virtually guarantees that more effective solutions will be developed over time. another can step in) at the cost of consistency. Furthermore. An alternative design might involve multiple masters that jointly manage the file namespace—such an architecture would increase availability (if one goes down. the entire file system becomes unavailable. There is no discussion of security in the original GFS paper. the Hadoop community is well-aware of this problem and has developed several reasonable workarounds—for example.2. As a result. mean time between failure measured in months are not uncommon for production deployments. It has been demonstrated that in large-scale distributed systems. 105]). .

depends on many factors: the number of mappers specified by the programmer serves as a hint to the execution framework. MAPREDUCE BASICS namenode namenode daemon job submission node jobtracker tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system … slave node slave node … slave node … Figure 2.36 CHAPTER 2. 2. but the actual number of tasks depends on both the number of input files and the number of HDFS data blocks occupied by those files. the return acknowledgment of the tasktracker heartbeat contains task allocation information. the architecture of a complete Hadoop cluster is shown in Figure 2. has empty task slots). The job submission node runs the jobtracker. on the other hand.6. Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for task allocation. these services run on two separate machines. A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks. which consists of three separate components: the HDFS master (called the namenode). The number of map tasks.6 HADOOP CLUSTER ARCHITECTURE Putting everything together. and a datanode daemon. Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data. which is the single point of contact for a client wishing to execute a MapReduce job. Typically. although in smaller clusters they are often co-located. which is responsible for actually running user code. The HDFS namenode runs the namenode daemon. Each map task is assigned a sequence of input key-value . the job submission node (called the jobtracker). The number of reduce tasks is equal to the number of reducers specified by the programmer.6: Architecture of a complete Hadoop cluster. The bulk of a Hadoop cluster consists of slave nodes (only three of which are shown in the figure) that run both a tasktracker. for serving HDFS data. and many slave nodes (three shown here). If a tasktracker is available to run tasks (in Hadoop parlance. The jobtracker monitors the progress of running MapReduce jobs and is responsible for coordinating the execution of the mappers and reducers.

2. called an input split in Hadoop. providing an opportunity to load state. for each intermediate key in the partition (defined by the partitioner). The programming model also guarantees that intermediate keys will be presented to the Reduce method in sorted order. This means that mappers can read in “side data”. Since these method calls occur in the context of the same Java object. Since large clusters are organized into racks. with far greater intra-rack bandwidth than inter-rack bandwidth.6. The alignment of input splits with HDFS block boundaries simplifies task scheduling. the Map method is called (by the execution framework) on all key-value pairs in the input split. mappers are Java objects with a Map method (among others). In scheduling map tasks. map tasks are scheduled on the slave node that holds the input split. The life-cycle of this object begins with instantiation. the execution framework repeatedly calls the Reduce method with an intermediate key and an iterator over all values associated with that key. the mapper object provides an opportunity to run programmer-specified termination code. where a hook is provided in the API to run programmer-specified code. Since this occurs in the context of a single object. The Hadoop API provides hooks for programmer-specified initialization and termination code. it is possible to preserve state across multiple intermediate keys (and associated values) within a single reduce task. too. If it is not possible to run a map task on local data. the jobtracker tries to take advantage of data locality—if possible. dictionaries. After initialization. In Hadoop. After all key-value pairs in the input split have been processed. etc. as we will see in the next chapter. Input splits are computed automatically and the execution framework strives to align them to HDFS block boundaries so that each map task is associated with a single data block. A mapper object is instantiated for every map task by the tasktracker. the execution framework strives to at least place map tasks on a rack which has a copy of the data block. will be important in the design of MapReduce algorithms. so that the mapper will be processing local data. . Although conceptually in MapReduce one can think of the mapper being applied to all input key-value pairs and the reducer being applied to all values associated with the same key. Each reducer object is instantiated for every reduce task. this property is critical in the design of MapReduce algorithms and will be discussed in the next chapter. static data sources. After initialization. The actual execution of reducers is similar to that of the mappers. actual job execution is a bit more complex. Once again. it becomes necessary to stream input key-value pairs across the network. it is possible to preserve state across multiple input key-value pairs within the same map task—this is an important property to exploit in the design of MapReduce algorithms. This. HADOOP CLUSTER ARCHITECTURE 37 pairs.

7 SUMMARY This chapter provides a basic overview of the MapReduce programming model. which is a tightly-integrated component of the MapReduce environment. MAPREDUCE BASICS 2. partitioners. and combiners. Significant attention is also given to the underlying distributed file system. reducers.38 CHAPTER 2. Given this basic understanding. we now turn our attention to the design of MapReduce algorithms. starting with its roots in functional programming and continuing with a description of mappers. .

Furthermore.. this also means that any conceivable algorithm that a programmer wishes to develop must be expressed in terms of a small number of rigidly-defined components that must fit together in very specific ways. the programmer needs only to implement the mapper. Synchronization is perhaps the most tricky aspect of designing MapReduce algorithms (or for that matter. come together—for example. All other aspects of execution are handled transparently by the execution framework—on clusters ranging from a single node to a few thousand nodes. In summary. primarily through examples. on which node in the cluster). Beyond that. the programmer has little control over many aspects of execution. the combiner and the partitioner. they are: • Which intermediate key-value pairs are processed by a specific reducer. for example: • Where a mapper or reducer runs (i.e. processes running on separate nodes in a cluster must. • When a mapper or reducer begins or finishes. and optionally. • Which input key-value pairs are processed by a specific mapper. at some point in time. over datasets ranging from gigabytes to petabytes. Two of these design patterns are used in the scalable inverted indexing algorithm we’ll present later in Chapter 4. However. Within a single MapReduce job. a guide to MapReduce algorithm design. which instantiate arrangements of components and specific techniques designed to handle frequently-encountered situations across a variety of problem domains. Nevertheless. The purpose of this chapter is to provide. parallel and distributed algorithms in general). It may not appear obvious how a multitude of algorithms can be recast into this programming model. These examples illustrate what can be thought of as “design patterns” for MapReduce. to distribute partial results from nodes that produced them to the nodes that will consume them. the programmer does have a number of techniques for controlling execution and managing the flow of data in MapReduce. . mappers and reducers run in isolation without any mechanisms for direct communication. Other than embarrassingly-parallel problems. the reducer.39 CHAPTER 3 MapReduce Algorithm Design A large part of the power of MapReduce comes from its simplicity: in addition to preparing the input data. concepts presented here will show up again in Chapter 5 (graph processing) and Chapter 6 (expectation-maximization algorithms). there is only one opportunity for cluster-wide synchronization—during the shuffle and sort stage where intermediate key-value pairs are copied from the mappers to the reducers and grouped by key.

The ability to control the partitioning of the key space. 5. 4. The standard solution is an external (nonMapReduce) program that serves as a “driver” to coordinate MapReduce iterations. which we dub “pairs” and “stripes”. The ability to construct complex data structures as keys and values to store and communicate partial results. . The gold standard. and the ability to execute user-specified termination code at the end of a map or reduce task. This chapter explains how various techniques to control code execution and data flow can be applied to design algorithms in MapReduce. Many algorithms are iterative in nature.2 uses the example of building word co-occurrence matrices on large text corpora to illustrate two common design patterns. Often. The ability to preserve state in both mappers and reducers across multiple input or intermediate keys. • Section 3. The proper use of combiners is discussed in detail. which requires orchestrating data so that the output of one job becomes the input to the next. The focus is both on scalability—ensuring that there are no inherent bottlenecks as algorithms are applied to increasingly larger datasets—and efficiency—ensuring that algorithms do not needlessly consume resources and thereby reducing the cost of parallelization. and therefore the order in which a reducer will encounter particular keys. requiring repeated execution until some convergence criteria—graph algorithms in Chapter 5 and expectation-maximization algorithms in Chapter 6 behave in exactly this way. The ability to control the sort order of intermediate keys. 3. One must often decompose complex algorithms into a sequence of jobs. The ability to execute user-specified initialization code at the beginning of a map or reduce task.40 CHAPTER 3. It is important to realize that many algorithms cannot be easily expressed as a single MapReduce job. as well as the “in-mapper combining” design pattern. an algorithm running on twice the number of nodes should only take half as long. and therefore the set of keys that will be encountered by a particular reducer. MAPREDUCE ALGORITHM DESIGN 1. 2. These two approaches are useful in a large class of problems that require keeping track of joint events across a large number of observations. the convergence check itself cannot be easily expressed in MapReduce. Similarly. of course.1 introduces the important concept of local aggregation in MapReduce and strategies for designing efficient algorithms that minimize the amount of partial results that need to be copied across the network. The chapter is organized as follows: • Section 3. is linear scalability: an algorithm running on twice the amount of data should take only twice as long.

the aggregate statistic can be computed in the reducer before the individual elements are encountered. . Through use of the combiner and by taking advantage of the ability to preserve state across multiple inputs. intermediate results are written to local disk before being sent over the network. We call this technique “value-to-key conversion”. • Section 3. and memory-backed joins.3 shows how co-occurrence counts can be converted into relative frequencies using a pattern known as “order inversion”. For convenience. this would require two passes over the data. Furthermore.1 COMBINERS AND IN-MAPPER COMBINING We illustrate various techniques for local aggregation using the simple word count example presented in Section 2. Figure 3. 3.1 LOCAL AGGREGATION In the context of data-intensive distributed processing. which is quite simple: the mapper emits an intermediate key-value pair for each term observed. 3.4 provides a general solution to secondary sorting.1 repeats the pseudo-code of the basic algorithm. where pieces of intermediate data are sorted into exactly the order that is required to carry out a series of computations.3. In a cluster environment. in Hadoop. Often.2. In MapReduce. a reducer needs to compute an aggregate statistic on a set of elements before individual elements can be processed. this necessarily involves transferring data over the network. This may seem counter-intuitive: how can we compute an aggregate statistic on a set of elements before encountering elements of that set? As it turns out.1. map-side. but with the “order inversion” design pattern. reducers sum up the partial counts to arrive at the final count. from the processes that produced them to the processes that will ultimately consume them. local aggregation of intermediate results is one of the keys to efficient algorithms. Normally. which is the problem of sorting values associated with a key in the reduce phase. clever sorting of special key-value pairs enables exactly this. • Section 3. reductions in the amount of intermediate data translate into increases in algorithmic efficiency. with the exception of embarrassingly-parallel problems. The sequencing of computations in the reducer can be recast as a sorting problem. it is often possible to substantially reduce both the number and size of key-value pairs that need to be shuffled from the mappers to the reducers. Since network and disk latencies are relatively expensive compared to other operations. the single most important aspect of synchronization is the exchange of intermediate results. LOCAL AGGREGATION 41 • Section 3.1.5 covers the topic of performing joins on relational datasets and presents three different approaches: reduce-side. with the term itself as the key and a value of one.

if the combiners take advantage of all opportunities for local aggregation. count 1) class Reducer method Reduce(term t.e. doc d) for all term t ∈ doc d do Emit(term t. there are two additional factors to consider. count sum) Figure 3.3). Due to the Zipfian nature of term distributions. .] do sum ← sum + c Emit(term t. combiners in Hadoop are treated as optional optimizations. The first technique for local aggregation is the combiner. most terms will not be observed by most mappers (for example. This results in a reduction in the number of intermediate key-value pairs that need to be shuffled across the network—from the order of total number of terms in the collection to the order of the number of unique terms in the collection. the combiners aggregate term counts across the documents processed by each map task. 1 More precisely. Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers—recall that they can be understood as “mini-reducers” that process the output of mappers. . . terms that occur only once will by definition only be observed by one mapper). this version emits a key-value pair for each unique term in the document.2 (the mapper is modified but the reducer remains the same as in Figure 3. counts [c1 . since every term could have been observed in every mapper. . .. this can yield substantial savings in the number of intermediate key-value pairs emitted. the algorithm would generate at most m × V intermediate key-value pairs. c2 . a document about dogs is likely to have many occurrences of the word “dog”).]) sum ← 0 for all count c ∈ counts [c1 . In this example. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. already discussed in Section 2.4.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from Figure 2.1 An improvement on the basic algorithm is shown in Figure 3. so there is no guarantee that the execution framework will take advantage of all opportunities for partial aggregation. especially for long documents. c2 . Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document.1 and therefore is not repeated). . However. where m is the number of mappers and V is the vocabulary size (number of unique terms in the collection). Given that some words appear frequently within a document (for example. . An associative array (i. On the other hand.42 CHAPTER 3.

count H{t}) Tally counts for entire document Figure 3. With this technique. which is responsible for processing a block of input keyvalue pairs. “in-mapper combining”. it provides control over when local aggregation occurs and how it exactly takes place. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(term t. so that we can refer to the pattern more conveniently throughout the book.3 (once again. We’ll see later on how this pattern can be applied to a variety of problems. However. There is no need to run a separate combiner. In contrast. we can continue to accumulate partial term counts in the associative array across multiple documents. since all opportunities for local aggregation are already exploited. There are two main advantages to using this design pattern: First.3. as illustrated in the variant of the word count algorithm in Figure 3. Prior to processing any input key-value pairs. only the mapper is modified). and emit key-value pairs only when the mapper has processed all documents. we are in essence incorporating combiner functionality directly inside the mapper. Recall that this API hook provides an opportunity to execute user-specified code after the Map method has been applied to all input key-value pairs of the input data split to which the map task was assigned. a (Java) mapper object is created for each map task. Recall. Reducer is the same as in Figure 3. In this case. This basic idea can be taken one step further. combiners can be run in the reduce phase also (when merging intermediate key-value pairs from different map tasks).1. The workings of this algorithm critically depends on the details of how map and reduce tasks in Hadoop are executed. the semantics of the combiner is underspecified in MapReduce. we initialize an associative array for holding term counts. . 2 Leaving aside the minor complication that in Hadoop. emission of intermediate data is deferred until the Close method in the pseudo-code. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair). which is an API hook for user-specified code. in practice it makes almost no difference either way. That is.2 This is a sufficiently common design pattern in MapReduce that it’s worth giving it a name.6. discussed in Section 2. LOCAL AGGREGATION 1: 2: 3: 4: 5: 6: 7: 43 class Mapper method Map(docid a. the mapper’s Initialize method is called.1.2: Pseudo-code for the improved MapReduce word count algorithm that uses an associative array to aggregate term counts on a per-document basis.

In contrast. doc d) for all term t ∈ doc d do H{t} ← H{t} + 1 method Close for all term t ∈ H do Emit(term t. but there are practical consequences as well. Reducer is the same as in Figure 3. perhaps multiple times. There are. which is exactly why programmers often choose to perform their own local aggregation in the mappers. the mappers will generate only those key-value pairs that need to be shuffled across the network to the reducers. it breaks the functional programming underpinnings of MapReduce. This process involves unnecessary object creation and destruction (garbage collection takes time). MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Initialize H ← new AssociativeArray method Map(docid a.2. which has the option of using it. With the algorithm in Figure 3. In some cases (although not in this particular example). or not at all (or even in the reduce phase). Preserving state across multiple input instances means that algorithmic behavior may depend on the order in which input key-value pairs are encountered. For example. in-mapper combining will typically be more efficient than using actual combiners. First. The combiner is provided as a semantics-preserving optimization to the execution framework. object serialization and deserialization (when intermediate key-value pairs fill the in-memory buffer holding map outputs and need to be temporarily spilled to disk). but don’t actually reduce the number of key-value pairs that are emitted by the mappers in the first place. intermediate key-value pairs are still generated on a per-document basis. Hadoop makes no guarantees on how many times the combiner is applied. however. One reason for this is the additional overhead associated with actually materializing the key-value pairs. Ultimately. which are difficult to debug on large datasets in the general case (although the correctness of in-mapper . such indeterminism is unacceptable.44 CHAPTER 3. Second. This creates the potential for ordering-dependent bugs. only to be “compacted” by the combiners.1. count H{t}) Tally counts across documents Figure 3. drawbacks to the in-mapper combining pattern. Combiners reduce the amount of intermediate data that is shuffled across the network. or that it is even applied at all. since state is being preserved across multiple input key-value pairs. since pragmatic concerns for efficiency often trump theoretical “purity”. and furthermore. with in-mapper combining. this isn’t a big deal.3: Pseudo-code for the improved MapReduce word count algorithm that demonstrates the “in-mapper combining” design pattern.

either the block size or the memory usage threshold needs to be determined empirically: with too large a value. Furthermore. the mapper could keep track of its own memory footprint and flush intermediate key-value pairs once memory usage has crossed a certain threshold. Opportunities for aggregation. the distribution of keys themselves. the extent to which efficiency can be increased through local aggregation depends on the size of the intermediate key space. a well-known result in information retrieval. the algorithm in Figure 3. The idea is simple: instead of emitting intermediate data only after every key-value pair has been processed.3 will scale only up to a point. beyond which the associative array holding the partial term counts will no longer fit in memory. emit partial results after processing every n key-value pairs.3. these tasks are all competing for finite resources.1. In both approaches. but since the tasks are not aware of each other. opportunities for local aggregation may be lost. it is difficult to coordinate resource consumption effectively. however. in Hadoop physical memory is split between multiple tasks that may be running on a node concurrently. one often encounters diminishing returns in performance gains with increasing buffer sizes. one will often want to increase the input split size to limit the growth of the number of map tasks (in order to reduce the number of distinct copy operations necessary to shuffle intermediate data over the network). Second. In MapReduce algorithms. Heap’s Law.5 ([101]. 4 A few more details: note what matters is that the partial term counts encountered within particular input split fits into memory. come from having multiple values associated with the same key (whether one uses combiners or employs the in-mapper combining pattern). the mapper may run out of memory. This is straightforwardly implemented with a counter variable that keeps track of the number of input key-value pairs that have been processed. . there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. However. as collection sizes increase. the memory footprint is bound by the vocabulary size. and the number of key-value pairs that are emitted by each individual map task. after all.4 One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “flush” in-memory data structures periodically. In the word count example. since it is theoretically possible that a mapper encounters every term in the collection.3 Therefore. local aggregation is effective because 3 In more detail. Jeff Dean). Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and b ∼ 0. As an alternative. In the word count example. accurately models the growth of vocabulary size as a function of the collection size—the somewhat surprising fact is that the vocabulary size never stops growing. It critically depends on having sufficient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split. p. but with too small a value. Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT b . 81). In practice. where T is the number of tokens in the collection. such that it is not worth the effort to search for an optimal buffer size (personal communication. LOCAL AGGREGATION 45 combining for word count is easy to demonstrate).

the correctness of the algorithm cannot depend on computations performed by the combiner or depend on them even being run at all. the reducer cannot be used as a combiner in this case. Consider what would happen if we did: the combiner would compute the mean of an arbitrary subset . Zipfian) distribution of values associated with intermediate keys. Consider a simple example: we have a large dataset where input keys are strings and input values are integers. The reducer keeps track of the running sum and the number of integers encountered. but suffers from the same drawbacks as the basic word count algorithm in Figure 3. which simply passes all input key-value pairs to the reducers (appropriately grouped and sorted). A real-world example might be a large user log from a popular website.46 CHAPTER 3. This information is used to compute the mean once all values are processed.1.2 ALGORITHMIC CORRECTNESS WITH LOCAL AGGREGATION Although use of combiners can yield dramatic reductions in algorithm running time. care must be taken in applying them. we substantially reduce the number of values associated with frequently-occurring terms. In any MapReduce program. In cases where the reduce computation is both commutative and associative. we do not filter frequently-occurring words: therefore. the reducer that’s responsible for computing the count of ‘the’ will have a lot more work to do than the typical reducer. Local aggregation is also an effective technique for dealing with reduce stragglers (see Section 2.. the reducer can also be used (unmodified) as the combiner (as is the case with the word count example).3) that result from a highly-skewed (e. This algorithm will indeed work. The mean is then emitted as the output value in the reducer (with the input string as the key). which alleviates the reduce straggler problem. 3. In our word count example. which is highly inefficient. In the general case. where keys represent user ids and values represent some measure of activity such as elapsed time for a particular session—the task would correspond to computing the mean session length on a per-user basis. the reducer input key-value type must match the mapper output key-value type: this implies that the combiner input and output key-value types must match the mapper output key-value type (which is the same as the reducer input key-value type). however. without local aggregation. and we wish to compute the mean of all integers associated with the same key (rounded to the nearest integer). MAPREDUCE ALGORITHM DESIGN many words are encountered multiple times within a map task. With local aggregation (either combiners or in-mapper combining). Since combiners in Hadoop are viewed as optional optimizations. and therefore will likely be a straggler. We use an identity mapper. Figure 3.g. combiners and reducers are not interchangeable. Unlike in the word count example.4 shows the pseudo-code of a simple algorithm for accomplishing this task that does not involve combiners. which would be useful for understanding user demographics.1: it requires shuffling all key-value pairs from mappers to reducers across the network.

and emitted as the output of the combiner. . since such fine-grained control over the combiners is impossible in MapReduce. 4.). However. As a concrete example. 5)) In general. there are no prohibitions in MapReduce for more complex types. 2). and the reducer would compute the mean of those values. Up until now. one special case in which using reducers as combiners would produce the correct result: if each combiner computed the mean of equal-size subsets of the values. Mean(3.3. integer ravg ) Figure 3. Therefore. or Avro. The combiner receives each string and the associated list of integer values. 2. integers. all keys and values in our algorithms have been primitives (string. LOCAL AGGREGATION 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 9: 47 class Mapper method Map(string t. we know that: Mean(1. . The sum and count are packaged into a pair.4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values associated with the same key. either custom types or types defined using a library such as Protocol Buffers. 4. 5 There . this represents a key technique in MapReduce algorithm design that we introduced at the beginning of this is.5 So how might we properly take advantage of combiners? An attempt is shown in Figure 3. .5. from which it computes the sum of those values and the number of integers encountered (i.e. 3. such a scenario is highly unlikely. in fact. with the same string as the key. 5) = Mean(Mean(1. However.6 and. 6 In Hadoop.. . etc. Thrift. r2 .]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . the count). of values associated with the same key. integer r) Emit(string t. this approach would not produce the correct result. but we have added a combiner that partially aggregates results by computing the numeric components necessary to arrive at the mean. the mean of means of arbitrary subsets of a set of numbers is not the same as the mean of the set of numbers. The mapper remains the same. integer r) class Reducer method Reduce(string t.] do sum ← sum + r cnt ← cnt + 1 ravg ← sum/cnt Emit(string t. r2 . integers [r1 . . In the reducer. . however. pairs of partial sums and counts can be aggregated to arrive at the mean.1.

So let us remove the combiner and see what happens: the output value type of the mapper is integer. r2 . . chapter.]) sum ← 0 cnt ← 0 for all pair (s. c1 ). c1 ). integer r) class Combiner method Combine(string t. cnt)) class Reducer method Reduce(string t.] do sum ← sum + r cnt ← cnt + 1 Emit(string t. Recall that combiners must have the same input and output key-value type. We will frequently encounter complex keys and values throughput the rest of this book.] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t.5: Pseudo-code for an incorrect first attempt at introducing combiners to compute the mean of values associated with each key.]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . Unfortunately. . pair (sum. This is clearly not the case. and more specifically. this algorithm will not work. . . (s2 . c) ∈ pairs [(s1 . remember that combiners are optimizations that cannot change the correctness of the algorithm. r2 . Recall from our previous discussion that Hadoop makes no guarantees on . . . c2 ) . . . The mismatch between combiner input and output key-value types violates the MapReduce programming model. pairs [(s1 . (s2 .48 CHAPTER 3. c2 ) . integers [r1 . But the reducer actually expects a list of pairs! The correctness of the algorithm is contingent on the combiner running on the output of the mappers. To understand why this restriction is necessary in the programming model. integer ravg ) Separate sum and count Figure 3. . . integer r) Emit(string t. which also must be the same as the mapper output type and the reducer input type. so the reducer expects to receive a list of integers as values. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Map(string t. that the combiner is run exactly once.

or multiple times. this algorithm transforms a non-associative operation (mean of numbers) into an associative operation (element-wise sum of a pair of numbers. pairs [(s1 . . c) ∈ pairs [(s1 .]) sum ← 0 cnt ← 0 for all pair (s. . except that the mean is computed at the end.6. pair (r. how many times combiners are called. c1 ). (s2 . c) ∈ pairs [(s1 . . integer r) Emit(string t. it could be zero. c1 ). In the mapper we emit as the value a pair consisting of the integer and one—this corresponds to a partial count over one instance. and emits pairs with updated sums and counts. pair (sum. Let us verify the correctness of this algorithm by repeating the previous exercise: What would happen if no combiners were run? With no combiners. This violates the MapReduce programming model. c2 ) . the mappers would send pairs (as values) directly to the reducers.3.] do sum ← sum + s cnt ← cnt + c Emit(string t. c2 ) . Another stab at the algorithm is shown in Figure 3. and each of those would consist of an integer . . (s2 . There would be as many intermediate pairs as there were input key-value pairs.1. with an additional division at the very end). one. . cnt)) class Reducer method Reduce(string t.]) sum ← 0 cnt ← 0 for all pair (s. The combiner separately aggregates the partial sums and the partial counts (as before). c2 ) . (s2 . integer ravg ) Figure 3. . and this time. the algorithm is correct. c2 ) . . c1 ). 1)) class Combiner method Combine(string t. The reducer is similar to the combiner. pairs [(s1 .] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t. In essence. LOCAL AGGREGATION 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: 49 class Mapper method Map(string t. . This algorithm correctly takes advantage of combiners.6: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key. (s2 . c1 ).

The reducer would still arrive at the correct sum and count. The reducer is exactly the same as in Figure 3. Inside the mapper. but in this case the memory footprint of the data structures for holding intermediate data is likely to be modest. the partial sums and counts associated with each string are held in memory across input key-value pairs. We first touched on this technique in the previous section. Only the mapper is shown here. we present an even more efficient algorithm that exploits the in-mapper combining pattern. 3. pair (S{t}. similar to before.7.2 PAIRS AND STRIPES One common approach for synchronization in MapReduce is to construct complex keys and values in such a way that data necessary for a computation are naturally brought together by the execution framework. no matter how many times they run. Now add in the combiners: the algorithm would remain correct. Finally. making this variant algorithm an attractive option. Building on previously . Note that although the output key-value type of the combiner must be the same as the input key-value type of the reducer. the reducer is the same as in Figure 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key.6.50 CHAPTER 3. Intermediate key-value pairs are emitted only after the entire input split has been processed.6 and one. C{t})) Figure 3. in the context of “packaging” partial sums and counts in a complex value (i. Moving partial aggregation from the combiner directly into the mapper is subjected to all the tradeoffs and caveats discussed earlier this section. integer r) S{t} ← S{t} + r C{t} ← C{t} + 1 method Close for all term t ∈ S do Emit(term t. and hence the mean would be correct. the reducer can emit final key-value pairs of a different type.. pair) that is passed from mapper to combiner to reducer. since the combiners merely aggregate partial sums and counts to pass along to the reducers. the value is a pair consisting of the sum and count.e. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: class Mapper method Initialize S ← new AssociativeArray C ← new AssociativeArray method Map(string t. in Figure 3. illustrating the in-mapper combining design pattern.

As a running example.32 and 1. the vocabulary size). a large body of work in lexical semantics based on distributional profiles of words.g. which might provide a clue in thwarting terrorist activity. Indeed.3. which for real-world English corpora can be hundreds of thousands of words. and more generally.##). which would assist in inventory management and product placement on store shelves. for computing statistics such as pointwise mutual information [38]. or a document. Beyond text processing. and other related fields such as text mining. The algorithms discussed in this section could be adapted to tackle these related problems.19 both map to the same token (#. an intelligence analyst might wish to identify associations between re-occurring financial transactions that are otherwise unrelated. paragraph. 94]. the co-occurrence matrix of a corpus is a square n × n matrix where n is the number of unique words in the corpus (i. Note that the upper and lower triangles of the matrix are identical since co-occurrence is a symmetric relation.. a co-occurrence matrix M where mij is the count of how many times word i was immediately succeeded by word j would usually not be symmetric.. for unsupervised sense clustering [136]. a large retailer might analyze point-of-sale transaction records to identify correlated product purchases (e. this section introduces two common design patterns we have dubbed “pairs” and “stripes” that exemplify this strategy.. It is obvious that the space requirement for the word co-occurrence problem is O(n2 ). customers who buy this tend to also buy that). automatic thesaurus construction [137] and stemming [157]). problems in many application domains share similar characteristics. though in the general case relations between words need not be symmetric. A cell mij contains the number of times word wi co-occurs with word wj within a specific context—a natural unit such as a sentence. or even billions of words in web-scale collections. Similarly. concepts presented here are also used in Chapter 6 when we discuss expectation-maximization algorithms.2. . e. a very common task in statistical natural language processing for which there are nice MapReduce solutions. For example. PAIRS AND STRIPES 51 published work [54.7 The computation of the word co-occurrence matrix is quite simple if the entire matrix 7 The size of the vocabulary depends on the definition of a “word” and techniques (if any) for corpus preprocessing. or a certain window of m words (where m is an application-dependent parameter). This task is quite common in text processing and provides the starting point to many other algorithms. a common task in corpus linguistics and statistical natural language processing. Formally. dating back to Firth [55] and Harris [69] in the 1950s and 1960s.. this problem represents a specific instance of the task of estimating distributions of discrete joint events from a large number of observations.g. The task also has applications in information retrieval (e. where n is the size of the vocabulary. such that 1. One common strategy is to replace all rare words (below a certain frequency) with a “special” token such as <UNK> (which stands for “unknown”) to model out-of-vocabulary words.e.g. Another technique involves replacing numeric digits with #. For example. More importantly. we focus on the problem of building word co-occurrence matrices from large corpora.

The reducer performs an element-wise sum of all associative arrays with the same key. since with pairs the left element is repeated for every co-occurring word pair. The mapper processes each input document and emits intermediate key-value pairs with each co-occurring word pair as the key and the integer one (i. where each associative array encodes the co-occurrence counts of the neighbors of a particular word (i. We describe two MapReduce algorithms for this task that can scale to large corpora.. a na¨ implementation on a single machine can be very slow as memory is paged to ıve disk. The final associative array is emitted with the same word as the key. and therefore the execution framework has less sorting to perform. accumulating counts that correspond to the same cell in the co-occurrence matrix. The stripes representation is much more compact. Thus. However. An alternative approach. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. In contrast to the pairs approach. the count) as the value.e. Like the pairs approach.9. in the case where the matrix is too big to fit in memory. its context). values in the . This is straightforwardly accomplished by two nested loops: the outer loop iterates over all words (the left element in the pair). document ids and the corresponding contents make up the input key-value pairs. This algorithm illustrates the use of complex keys in order to coordinate distributed computations. in this case the reducer simply sums up all the values associated with the same cooccurring word pair to arrive at the absolute count of the joint event in the corpus.. Although compression techniques can increase the size of corpora for which word co-occurrence matrices can be constructed on a single machine. dubbed the “stripes” approach. denoted H. The mapper emits key-value pairs with words as keys and corresponding associative arrays as values. MAPREDUCE ALGORITHM DESIGN fits into memory—however. The stripes approach also generates fewer and shorter intermediate keys. is shown in Figure 3.8. The neighbors of a word can either be defined in terms of a sliding window or some other contextual unit such as a sentence. However. Each pair corresponds to a cell in the word co-occurrence matrix. Pseudo-code for the first algorithm. co-occurring word pairs are generated by two nested loops. It is immediately obvious that the pairs algorithm generates an immense number of key-value pairs compared to the stripes approach. it is clear that there are inherent scalability limitations. and the inner loop iterates over all neighbors of the first word (the right element in the pair).e. dubbed the “pairs” approach. each final key-value pair encodes a row in the co-occurrence matrix. the major difference is that instead of emitting intermediate key-value pairs for each co-occurring word pair. is presented in Figure 3.52 CHAPTER 3. co-occurrence information is first stored in an associative array. As usual. which is then emitted as the final key-value pair. The MapReduce execution framework guarantees that all associative arrays with the same key will be brought together in the reduce phase of processing.

. H3 . 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. H2 .9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices from large corpora. c2 . . . H2 . .3. . . . Stripe H) Tally words co-occurring with w class Reducer method Reduce(term w. . . PAIRS AND STRIPES 53 1: 2: 3: 4: 5: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. doc d) for all term w ∈ doc d do for all term u ∈ Neighbors(w) do Emit(pair (w.]) Hf ← new AssociativeArray for all stripe H ∈ stripes [H1 .]) s←0 for all count c ∈ counts [c1 . c2 . count s) Sum co-occurrence counts Figure 3. u). count 1) Emit count for each co-occurrence class Reducer method Reduce(pair p. . .2.] do s←s+c Emit(pair p. stripes [H1 . doc d) for all term w ∈ doc d do H ← new AssociativeArray for all term u ∈ Neighbors(w) do H{u} ← H{u} + 1 Emit(Term w. counts [c1 . . H3 .8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices from large corpora. H) Emit(term w. .] do Sum(Hf . stripe Hf ) Element-wise sum Figure 3.

at any point in time. since the mapper may run out of memory to store partial counts before all documents are processed. the above caveats remain: there will be far fewer opportunities for partial aggregation in the pairs approach due to the sparsity of the intermediate key space. since the respective operations in their reducers (addition and element-wise sum of associative arrays) are both commutative and associative. and come with more serialization and deserialization overhead than with the pairs approach. since it does not need to hold intermediate data in memory. However. The size of the associative array is bounded by the vocabulary size. each associative array is small enough to fit into memory—otherwise. as in the stripes case). The stripes approach makes the assumption that. Similarly. However. For common terms. the key space in the pairs approach is the cross of the vocabulary with itself. The sparsity of the key space also limits the effectiveness of in-memory combining. Therefore. the modification is sufficiently straightforward that we leave the implementation as an exercise for the reader. For both algorithms. MAPREDUCE ALGORITHM DESIGN stripes approach are more complex. we present previouslypublished results [94] that empirically answered this question. as the sizes of corpora increase. which is far larger—counts can be aggregated only when the same co-occurring word pair is observed multiple times by an individual mapper (which is less likely than observing multiple occurrences of a word. but certainly for terabyte-sized and petabyte-sized corpora that will be commonplace tomorrow. memory management will also be more complex than in the simple word count example. Both algorithms can benefit from the use of combiners. . necessitating some mechanism to periodically flush in-memory structures.7 GB.8 Prior to working 8 This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium (LDC catalog number LDC2007T07). which approach is faster? Here. In contrast. the associative array may grow to be quite large. which is itself unbounded with respect to corpus size (recall the previous discussion of Heap’s Law). memory paging will significantly impact performance. on the other hand. for the stripes approach. It is important to consider potential scalability bottlenecks of either algorithm.54 CHAPTER 3. combiners with the stripes approach have more opportunities to perform local aggregation because the key space is the vocabulary— associative arrays can be merged whenever a word is encountered multiple times by a mapper.27 million documents from the Associated Press Worldstream (APW) totaling 5. the in-mapper combining optimization discussed in the previous section can also be applied. necessitating some mechanism to periodically emit key-value pairs (which further limits opportunities to perform partial aggregation). does not suffer from this limitation. this will become an increasingly pressing issue—perhaps not for gigabyte-sized corpora. Given this discussion. We have implemented both algorithms in Hadoop and applied them to a corpus of 2. The pairs approach.

1). on varying cluster sizes. Results demonstrate that the stripes approach is much faster than the pairs approach: 666 seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the entire corpus (improvement by a factor of 5.e. This . the corpus was first preprocessed as follows: All XML markup was removed. which yields an R2 value close to one.7). Figure 3. Figure 3. The reducers emitted a total of 1. On the other hand. All tokens were then replaced with unique integers for a more efficient encoding. As expected.2 GB. After the combiners. doubling the cluster size makes the job twice as fast).6 billion intermediate key-value pairs totaling 31. the reducers emitted a total of 142 million final key-value pairs (the number of non-zero cells in the co-occurrence matrix). Virtualized computational units in EC2 are called instances.8 million key-value pairs remained.10 compares the running time of the pairs and stripes approach on different fractions of the corpus.1 GB.. with respect to the 20-instance cluster. This is confirmed by a linear regression applied to the running time data. only 28. with same setup as before). These results show highly desirable linear scaling characteristics (i. from 20 slave “small” instances all the way up to 80 slave “small” instances (along the x-axis). In the end.2. which quantifies the amount of intermediate data transferred across the network. An additional series of experiments explored the scalability of the stripes approach along another dimension: the size of the cluster.10 also shows that both algorithms exhibit highly desirable scaling characteristics—linear in the amount of input data. Running times are shown with solid squares. and the user is charged only for the instance-hours consumed. These experiments were made possible by Amazon’s EC2 service.11 (left) shows the running time of the stripes algorithm (on the same corpus. with a co-occurrence window size of two.3. The circles plot the relative size and speedup of the EC2 experiments. the stripes approach provided more opportunities for combiners to aggregate intermediate results. Viewed abstractly. followed by tokenization and stopword removal using standard tools from the Lucene search engine. each with two single-core processors and two disks. This is confirmed by a linear regression with an R2 value close to one. this was reduced to 1.69 million final key-value pairs (the number of rows in the co-occurrence matrix). thus greatly reducing network traffic in the shuffle and sort phase. refer back to our discussion of utility computing in Section 1. the mappers in the stripes approach generated 653 million intermediate key-value pairs totaling 48. These experiments were performed on a Hadoop cluster with 19 slave nodes. PAIRS AND STRIPES 55 with Hadoop.1 billion key-value pairs. After the combiners. the pairs and stripes algorithms represent two different approaches to counting co-occurring events from a large number of observations. Figure 3. Figure 3. which allows users to rapidly provision clusters of varying sizes for limited durations (for more information.11 (right) recasts the same results to illustrate scaling characteristics. The mappers in the pairs approach generated 2.

10: Running time of the “pairs” and “stripes” algorithms for computing word cooccurrence matrices on different fractions of the APW corpus.11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different sizes from EC2 (left). These experiments were performed on a Hadoop cluster with 19 slaves.56 CHAPTER 3. MAPREDUCE ALGORITHM DESIGN 4000 3500 Running time (seconds) 3000 2500 2000 1500 1000 R = 0.997 3x 3000 2000 2x 1000 1x 0 10 20 30 40 50 60 70 80 90 1x 2x 3x 4x Size of EC2 cluster (number of slave instances) Relative size of EC2 cluster Figure 3.992 500 0 0 20 40 60 80 100 Percentage of the APW corpus 2 "stripes" approach "pairs" approach R = 0. .999 2 Figure 3. 5000 4000 Running time (seconds) Relative speedup 4x R2 = 0. Scaling characteristics (relative speedup) in terms of increasing Hadoop cluster size (right). each with two single-core processors and two disks.

these two design patterns are broadly useful and frequently observed in a variety of applications. For this reason. A middle ground might be to record a subset of the co-occurring events with respect to a conditioning event. Recall that in this large square n × n matrix. where n = |V | (the vocabulary size). 2) . associated with ten separate keys. w N (wi . since each of the sub-stripes would be smaller. w )). it is worth noting that the pairs and stripes approaches represent endpoints along a continuum of possibilities. The pairs approach individually records each co-occurring event. divided by what is known as the marginal (the sum of the counts of the conditioning variable co-occurring with anything else). 1). Word wi may co-occur frequently with wj simply because one of the words is very common. wj ) w N (wi . (wi . Computing relative frequencies with the stripes approach is straightforward. data mining.e. COMPUTING RELATIVE FREQUENCIES 57 general description captures the gist of many algorithms in fields as diverse as text processing. This would be a reasonable solution to the memory limitations of the stripes approach.3. In the case of b = |V |. The drawback of absolute counts is that it doesn’t take into account the fact that some words appear more frequently than others. A simple remedy is to convert absolute counts into relative frequencies. We need the count of the joint event (word co-occurrence). . b). this is equivalent to the pairs approach. We might divide up the entire vocabulary into b buckets (e. via hashing)..1) Here. In the case of b = 1. this is equivalent to the standard stripes approach. In the reducer. where |V | is the vocabulary size. and . while the stripes approach records all co-occurring events with respect a conditioning event.3. ·) indicates the number of times a particular co-occurring word pair is observed in the corpus. Therefore. (wi . . so that words co-occurring with wi would be divided into b smaller “sub-stripes”. That is. This implementation requires minimal modification to the original stripes algorithm in Figure 3. N (·. f (wj |wi ). and then divide all the joint counts by the marginal to arrive at the relative frequency for all words. cell mij contains the number of times word wi co-occurs with word wj within a specific context. 3.9. To conclude. what proportion of the time does wj appear in the context of wi ? This can be computed using the following equation: f (wj |wi ) = N (wi .3 COMPUTING RELATIVE FREQUENCIES Let us build on the pairs and stripes algorithms presented in the previous section and continue with our running example of constructing the word co-occurrence matrix M for a large corpus. w ) (3. and bioinformatics.. (wi . counts of all words that co-occur with the conditioning variable (wi in the above example) are available in the associative array. it suffices to sum all those counts to arrive at the marginal (i.g.

and at some point there will not be sufficient memory to store all co-occurring words and their counts for the word we are conditioning on. we can easily detect if all pairs associated with the word we are conditioning on (wi ) have been encountered. so does that vocabulary size. the raw byte representation is used to compute the hash value. and then emit those results in the final key-value pairs. We must ensure that all pairs with the same left word are sent to the same reducer. aardvark) and (dog. MAPREDUCE ALGORITHM DESIGN illustrates the use of complex data structures to coordinate distributed computations in MapReduce. How might one compute relative frequencies with the pairs approach? In the pairs approach. This algorithm will indeed work. and then by the right word. the reducer could simply divide the joint counts by the marginal to compute the relative frequencies. in essence building the associative array in the stripes approach. but it suffers from the same drawback as the stripes approach: as the size of the corpus grows. the reducer can preserve state across multiple keys. There is one more modification necessary to make this algorithm work. That is. At that point we can go back through the in-memory buffer. such an algorithm is indeed possible. compute the relative frequencies. which can be explicitly controlled by the programmer. Fortunately. the advantage of the pairs approach is that it doesn’t suffer from any memory bottlenecks. For a complex key. there is no guarantee that. as with before. From this alone it is not possible to compute f (wj |wi ) since we do not have the marginal. wj ) as the key and the count as the value. we must define the sort order of the pair so that keys are first sorted by the left word. As a result. modulo the number of reducers.58 CHAPTER 3. for example. This. The insight lies in properly sequencing data presented to the reducer. the programmer can define the sort order of keys so that data needed earlier is presented . the partitioner should partition based on the hash of the left word only. unfortunately. To make this work. does not happen automatically: recall that the default partitioner is based on the hash value of the intermediate key. If it were possible to somehow compute (or otherwise obtain access to) the marginal in the reducer before processing the joint counts. Note that. the reducer receives (wi . Is there a way to modify the basic pairs approach so that this advantage is retained? As it turns out. (dog. To produce the desired behavior. That is. as in the mapper. we can buffer in memory all the words that co-occur with wi and their counts. one can use the MapReduce execution framework to bring together all the pieces of data required to perform a computation. The notion of “before” and “after” can be captured in the ordering of key-value pairs. Through appropriate structuring of keys and values. we must define a custom partitioner that only pays attention to the left word. zebra) are assigned to the same reducer. For computing the co-occurrence matrix. although it requires the coordination of several mechanisms in MapReduce. Given this ordering. Inside the reducer. this algorithm also assumes that each associative array fits into memory.

To compute relative frequencies. This illustrates the application of the order inversion design pattern.1] [1] [2. . that represents the contribution of the word pair to the marginal. This is accomplished by defining the sort order of the keys so that pairs with the special symbol of the form (wi . . we must make sure that the special key-value pairs representing the partial marginal contributions are processed before the normal key-value pairs representing the joint counts.12. ∗). Summing these counts will yield the final joint count. let’s say the first of these is the key (dog. i. The reducer holds on to this value as it processes subsequent keys. aardwolf) . the in-mapper combining pattern can be used to even more efficiently aggregate marginal counts. . . ∗) are ordered before any other key-value pairs where the left word is wi . 8514. . these partial marginal counts will be aggregated before being sent to the reducers. the reducer will encounter a series of keys representing joint counts. which lists the sequence of key-value pairs that a reducer might encounter. In addition..1. the reducer is presented with the special key (dog.12: Example of the sequence of key-value pairs presented to the reducer in the pairs algorithm for computing relative frequencies.e.. ∗). . we modify the mapper so that it additionally emits a “special” key of the form (wi . w ). the number of times dog and aardvark co-occur in the entire collection. with a value of one. w ) = 1267 Figure 3.3. values [6327. Through use of combiners. each mapper emits a keyvalue pair with the co-occurring word pair as the key. each of which represents a partial marginal contribution from the map phase (assume here either combiners or in-mapper combining.3. to the reducer before data that is needed later. With the data properly sequenced. Alternatively. aardvark). zebra) (doge. COMPUTING RELATIVE FREQUENCIES 59 key (dog. ∗) and a number of values.. ∗) (dog. w ) = 42908 f (aardvark|dog) = 3/42908 f (aardwolf|dog) = 1/42908 f (zebra|dog) = 5/42908 compute marginal: w N (doge.1. (dog. At this point.. w N (dog. . as with before we must also properly define the partitioner to pay attention to only the left word in each pair. Recall that in the basic pairs algorithm. In the reducer.. Associated with this key will be a list of values representing partial joint counts from the map phase (two separate values in this case). After (dog. ∗) .] compute marginal: w N (dog. First.1] [682. so the values represent partially aggregated counts). aardvark) (dog.] [2. However. A concrete example is shown in Figure 3. we still need to compute the marginal counts. the reducer can directly compute the relative frequencies. The reducer accumulates these counts to arrive at the marginal.

the order inversion design pattern described in the previous section). we can present data to the reducer in the order necessary to perform the proper computations. an algorithm requires data in some fixed order: by controlling how keys are sorted and how the key space is partitioned.g. the reducer resets its internal state and starts to accumulate the marginal all over again. This greatly cuts down on the amount of partial results that the reducer needs to hold in memory. and therefore we have eliminated the scalability bottleneck of the previous algorithm. • Defining a custom partitioner to ensure that all pairs with the same left word are shuffled to the same reducer. simple arithmetic suffices to compute the relative frequency. we can access the result of a computation in the reducer (for example. which we call “order inversion”. occurs surprisingly often and across applications in many domains. The key insight is to convert the sequencing of computations into a sorting problem. 3. No buffering of individual co-occurring word counts is necessary. since only the marginal (an integer) needs to be stored. This design pattern. All subsequent joint counts are processed in exactly the same manner. Observe that the memory requirement for this algorithm is minimal. • Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co-occurrence counts. It is so named because through proper coordination. an aggregate statistic) before processing the data needed for that computation. When the reducer encounters the next special key-value pair (doge..4 SECONDARY SORTING MapReduce sorts intermediate key-value pairs by the keys during the shuffle and sort phase. which is very convenient if computations inside the reducer rely on sort order (e. ∗). MAPREDUCE ALGORITHM DESIGN since the reducer already knows the marginal.60 CHAPTER 3. In most cases. As we will see in Chapter 4. . To summarize. this design pattern is also used in inverted index construction to properly set compression parameters for postings lists. However. the specific application of the order inversion design pattern for computing relative frequencies requires the following: • Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its contribution to the marginal. • Preserving state across multiple keys in the reducer to first compute the marginal based on the special key-value pairs and then dividing the joint counts by the marginals to arrive at the relative frequencies.

instead of emitting the sensor id as the key.. r14209 ) (t1 . we also need to sort by value? Google’s MapReduce implementation provides built-in functionality for (optional) secondary sorting.g. A dump of the sensor data might look something like the following. one or more complex records.. by time). but may be a series of values. t1 ) → (r80521 ) m1 → (t1 . m1 . and then sort within the groupings another way (e. r80521 ) . m3 .4. Hadoop. m2 . The basic idea is to move part of the value into the intermediate key to form a composite key. Consider the example of sensor data from a scientific experiment: there are m sensors each taking readings on continuous basis. Fortunately. where rx after each timestamp represents the actual sensor readings (unimportant for this discussion. However. or even raw bytes of images). r21823 ) (t2 .. A MapReduce program to accomplish this might map over the raw data and emit the sensor id as the intermediate key. there is a general purpose solution. we would emit the sensor id and the timestamp as a composite key: (m1 . What if we are working with a high frequency sensor or sensor readings over a long period of time? What if the sensor readings themselves are large complex objects? This approach may not scale in these cases—the reducer would run out of memory trying to buffer all values associated with the same key. which guarantees that values arrive in sorted order. r98347 ) Suppose we wish to reconstruct the activity at each individual sensor over time.g. m1 . where m is potentially a large number. (t2 . However. In the above example. the sensor readings will not likely be in temporal order. m3 . by sensor id). which we call the “value-to-key conversion” design pattern. r66508 ) (t2 . r80521 ) (t1 . (t1 . unfortunately. since in many applications we wish to first group together data one way (e.3. does not have this capability built in. and let the MapReduce execution framework handle the sorting. with the rest of each record as the value: This would bring all readings from the same sensor together in the reducer. r76042 ) . SECONDARY SORTING 61 what if in addition to sorting by key. it should be apparent by now that any in-memory buffering of data introduces a potential scalability bottleneck. The most obvious solution is to buffer all the readings in memory and then sort by timestamp before additional processing.. This is a common problem. m2 . since MapReduce makes no guarantees about the ordering of values associated with the same key.

t3 ) → [(r146925 )] .g. It is widely believed that insights gained by mining historical. One can explicitly implement secondary sorting in the reducer. Data warehouses form a foundation for business intelligence applications designed to provide decision support.. although one would be duplicating functionality that is already present in the MapReduce execution framework. In an enterprise setting.. but increasingly data warehouses are used to store semi-structured data (e. We must define the intermediate key sort order to first sort by the sensor id (the left element in the pair) and then by the timestamp (the right element in the pair).10 With value-to-key conversion. sorting. the key-value pairs will be presented to the reducer in the correct sorted order: (m1 . t1 ) → [(r80521 )] (m1 . value-to-key conversion) is where sorting is performed. 3. Traditionally. this need not be an in-memory sort. particularly those optimized for a specific workload known as online analytical processing (OLAP). the data is relational in nature. sorting is offloaded to the MapReduce execution framework.5 RELATIONAL JOINS One popular application of Hadoop is data-warehousing. query logs) as well as unstructured data. However. note that sensor readings are now split across multiple keys. . quaternary. Typically. a data warehouse serves as a vast repository of data. in principle. data warehouses have been implemented through relational databases. holding everything from sales transactions to product inventories. 10 Note that. It makes more sense to take advantage of functionality that is already present with value-to-key conversion. Note that this approach can be arbitrarily extended to tertiary. It is entirely possible to implement a disk-based sort within the reducer. MAPREDUCE ALGORITHM DESIGN The sensor reading itself now occupies the value.9 The basic tradeoff between the two approaches discussed above (buffer and inmemory sort vs. 9 Alternatively. etc.. t2 ) → [(r21823 )] (m1 . This pattern results in many more keys for the framework to sort. current. and prospective data can yield competitive advantages in the marketplace. Properly orchestrated. but distributed sorting is a task that the MapReduce runtime excels at since it lies at the heart of the programming model.62 CHAPTER 3. A number of vendors offer parallel databases. which is likely to be faster but suffers from a scalability bottleneck. We must also implement a custom partitioner so that all pairs associated with the same sensor are shuffled to the same reducer. but customers Hadoop provides API hooks to define “groups” of intermediate keys that should be processed together in the reducer. The reducer will need to preserve state and keep track of when readings associated with the current sensor end and the next sensor begin.

vertica. and instead focus on discussing algorithms. RELATIONAL JOINS 63 find that they often cannot cost-effectively scale to the crushing amounts of data an organization needs to deal with today. Parallel databases are often quite expensive— on the order of tens of thousands of dollars per terabyte of user data. famously decried MapReduce as “a major step backwards” in a controversial blog post. It was originally developed by Yahoo. We should stress here that even though Hadoop has been applied to process relational data. We shall refrain here from participating in this lively debate. However. but is now also an open-source project. generically named S and T . s3 . discussed Facebook’s experiences with scaling up business intelligence applications with Oracle databases. they ran a series of benchmarks that demonstrated the supposed superiority of column-oriented parallel databases over Hadoop [120. for example. S3 ) . s1 .. which they ultimately abandoned in favor of a Hadoop-based solution developed in-house called Hive (which is now an open-source project).3. two well-known figures in the database community. it is instructive to understand the algorithms that underlie basic relational operations. Let us suppose that relation S looks something like the following: (k1 . Pig [114] is a platform for massive data analytics built on Hadoop and capable of handling structured as well as semi-structured data. Similarly. see Dean and Ghemawat’s counterarguments [47] and recent attempts at hybrid architectures [1]. it is highly unlikely that an analyst interacting with a data warehouse will ever be called upon to write MapReduce programs (and indeed.com/database-innovation/mapreduce-a-major-step-backwards/ .5. Hadoop has gained popularity as a platform for data-warehousing. There is an ongoing debate between advocates of parallel databases and proponents of MapReduce regarding the merits of both approaches for OLAP-type workloads. where k is the key we would like to join on. 144]. Given successful applications of Hadoop to data-warehousing and complex analytical queries that are prevalent in such an environment. and the Sn after sn denotes other attributes in the tuple (unimportant for the purposes of the join). S1 ) (k2 . Dewitt and Stonebraker.11 With colleagues. From an application point of view.. it makes sense to examine MapReduce algorithms for manipulating relational data. Nevertheless. Hammerbacher [68]. s2 . Hadoop is not a database. This section focuses specifically on performing relational joins in MapReduce. sn is a unique id for the tuple. Hadoop-based systems such as Hive and Pig present a much higher-level language for interacting with large amounts of data). Over the past few years. S2 ) (k3 . suppose relation T looks something like this: 11 http://databasecolumn. This section presents three different strategies for performing relational joins on two datasets (relations).

. in which case k could be interpreted as the primary key (i. T99 )] Since we’ve emitted the join key as the intermediate key. The first and simplest is a one-to-one join. tn is a unique id for the tuple. t1 . 3. T3 ) . then we know that one must be from S and the other must be from T . where at most one tuple from S and one tuple from T share the same join key (but it may be the case that no tuple from S shares the join key with a tuple from T . T84 )] → [(s68 . In more detail. or vice versa). we present one realistic scenario: S might represent a collection of user profiles.. etc. might represent logs of online activity.64 CHAPTER 3. T1 ) (k3 . recall that in the 12 Not very important if the intermediate data is compressed. t2 . S81 )] → [(t99 . and the Tn after tn denotes other attributes in the tuple. Each tuple might correspond to a page view of a particular URL and may contain additional information such as time spent on the page. The other dataset. there are three different cases to consider.. S64 ). the algorithm sketched above will work fine.12 If there are two values associated with a key. The reducer will be presented keys and lists of values along the lines of the following: k23 k37 k59 k61 . This approach is known as a parallel sort-merge join in the database community [134]. → [(s64 .e. However. The idea is quite simple: we map over both datasets and emit the join key as the intermediate key. (t84 . gender. Joining these two datasets would allow an analyst. for example. To make this task more concrete.. ad revenue generated.. In this case. and the tuple itself as the intermediate value. The k in these tuples could be interpreted as the foreign key that associates each individual page view with a user. all tuples will be grouped by the join key—which is exactly what we need to perform the join operation. (s81 . T2 ) (k8 . . t3 . etc. The tuples might contain demographic information such as age. where k is the join key. T . T97 ). we can remove it from the value to save a bit of space. user id). Since MapReduce guarantees that all values with the same key are brought together. income. MAPREDUCE ALGORITHM DESIGN (k1 . S68 )] → [(t97 . to break down online activity in terms of demographics.5.1 REDUCE-SIDE JOIN The first approach to relational joins is what’s known as a reduce-side join.

We can proceed to join the two tuples and perform additional computations (e. so the first value might be from S or from T . etc. RELATIONAL JOINS 65 basic MapReduce programming model.. Two additional changes are required: First. we have no idea when the value corresponding to the tuple from S will be encountered. this means that no tuple in the other dataset shares the join key. t137 ) → [(T137 )] . and then sort all tuple ids from S before all tuple ids from T . so the reducer does nothing. not very important if the intermediate data is compressed. since values are arbitrarily ordered.). Finally. In the mapper. This is a problem that requires a secondary sort. no guarantees are made about value ordering. Since the MapReduce execution framework performs the sorting.. the above algorithm works as well. Consider what happens at the reducer: 13 Once again.. instead of simply emitting the join key as the intermediate key. we instead create a composite key consisting of the join key and the tuple id (from either S or T ). let us consider the many-to-many join case. .. pick out the tuple from S. Assuming that S is the smaller dataset. The reducer can hold this tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new join key is encountered). k is the primary key in S). and the solution lies in the value-to-key conversion design pattern we just presented. and then cross it with every tuple from T to perform the join. so that S is the “one” and T is the “many”. there is no need to buffer tuples (other than the single one from S). The easiest solution is to buffer all values in memory. Second. t101 ) → [(T101 )] (k82 . we have eliminated the scalability bottleneck. s105 ) → [(S105 )] (k82 .e. However. we must define the partitioner to pay attention to only the join key. filter by some other attribute. After applying the value-to-key conversion design pattern. this creates a scalability bottleneck since we may not have sufficient memory to hold all the tuples with the same join key. Since both the join key and the tuple id are present in the intermediate key. we must define the sort order of the keys to first sort by the join key.5.g. it is guaranteed that the associated value will be the relevant tuple from S.13 Whenever the reducer encounters a new join key. The above algorithm will still work. Assume that tuples in S have unique join keys (i. Thus. compute aggregates. so that all composite keys with the same join key arrive at the same reducer. as we have seen several times already. If there is only one value associated with a key. t98 ) → [(T98 )] (k82 . we can remove them from the value to save a bit of space. Let us now consider the one-to-many join. the reducer will be presented with keys and values along the lines of the following: (k82 . but when processing each key in the reducer.3.

. yes. This can be accomplished in parallel. In practice. t98 ) → [(T98 )] (k82 . The approach isn’t particularly efficient since it requires shuffling both datasets across the network. Of course.66 CHAPTER 3. 3. which may include multiple steps. (k82 . the datasets that are to be joined may be the output of previous processes (either MapReduce jobs or other code). which is a limitation of this algorithm (and why we want to control the sort order so that the smaller dataset comes first). If the workflow is known in advance and relatively static (both reasonable assumptions in a mature workflow). a map-side join. We can parallelize this by partitioning and sorting both datasets in the same way. A map-side join is far more efficient than a reduce-side join since there is no need to shuffle the datasets over the network. etc. which the reducer can buffer in memory. s105 ) → [(S105 )] (k82 . suppose S and T were both divided into ten files. partitioned in the same manner by the join key. We can perform a join by scanning through both datasets simultaneously—this is known as a merge join in the database community.14 No reducer is required. This leads us to the map-side join. by using a custom 14 Note that this almost always implies a non-local read. Therefore.. For example.. t137 ) → [(T137 )] . As the reducer processes each tuple from T . t101 ) → [(T101 )] (k82 . But is it realistic to expect that the stringent conditions required for map-side joins are satisfied? In many cases. the second file with S with the second file of T . The basic idea behind the reduce-side join is to repartition the two datasets by the join key.. it is crossed with all the tuples from S. we map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join. s124 ) → [(S124 )] . unless the programmer wishes to repartition the output or perform further processing. The reason is that relational joins happen within the broader context of a workflow.5. .2 MAP-SIDE JOIN Suppose we have two datasets that are both sorted by the join key. Further suppose that in each file. we can engineer the previous processes to generate output sorted and partitioned in a way that makes efficient map-side joins possible (in MapReduce. in the map phase of a MapReduce job—hence. we simply need to merge join the first file of S with the first file of T . the tuples were sorted by the join key. In this case. All the tuples from S with the same join key will be encountered first. MAPREDUCE ALGORITHM DESIGN (k82 . we are assuming that the tuples from S (with the same join key) will fit into memory.

.5. map-side joins will require repartitioning of the data. reduce-side joins are a more general. We assume here that the datasets to be joined were produced by previous MapReduce jobs. it is always possible to repartition a dataset using an identity mapper and reducer. such that S = S1 ∪ S2 ∪ . What if neither dataset fits in memory? The simplest solution is to divide the smaller dataset. into n partitions. the reducers used to generate data that will participate in a later map-side join must not emit any key but the one they are currently processing. albeit less efficient. ∪ Sn . Mappers are then applied to the other (larger) dataset. let’s say S. the mapper probes the in-memory dataset to see if there is a tuple with the same join key. we can load the smaller dataset into memory in every mapper. . We can choose n so that each partition is small enough to fit in memory. requires streaming through the other dataset n times. input and output keys need not be the same. then the output dataset from the reducer will not necessarily be partitioned in a manner consistent with the specified partitioner (because the partitioner applies to the input keys rather than the output keys). solution. But of course. so this restriction applies to keys the reducers in those jobs may emit. Since map-side joins depend on consistent partitioning and sorting of keys. this incurs the cost of shuffling data over the network. recall from Section 2. reducers’ output keys must be exactly same as their input keys. The mapper initialization API hook (see Section 3. populating an associative array to facilitate random access to tuples based on the join key. Alternatively. Consider the case where datasets have multiple keys that one might wish to join on—then no matter how the data is organized.15 However. There is a final restriction to bear in mind when using map-side joins with the Hadoop implementation of MapReduce. RELATIONAL JOINS 67 partitioner and controlling the sort order of key-value pairs). Hadoop permits reducers to emit keys that are different from the input key whose values they are processing (that is. In this situation. For ad hoc data analysis. and for each input key-value pair.3 MEMORY-BACKED JOIN In addition to the two previous approaches to joining relational data that leverage the MapReduce framework to bring together tuples that share a common join key. if the output key of a reducer is different from the input key. nor even the same type). . This.3.1.2 that in Google’s implementation. there is a family of approaches we call memory-backed joins based on random access probes. and then run n memory-backed hash joins. 3.1) can be used for this purpose. the join is performed. This is known as a simple hash join by the database community [51].5. The simplest version is applicable when one of the two datasets completely fits in memory on each node. 15 In contrast. If there is. of course.

we present a number of “design patterns” that capture effective solutions to common problems. By moving part of the value into the key. it is often necessary to batch queries before making synchronous requests (to amortize latency over many requests) or to rely on asynchronous requests. we can send the reducer the result of a computation (e. we keep track of each joint event separately. the mapper aggregates partial results across multiple input records and only emits intermediate key-value pairs after some amount of local aggregation is performed. 16 In order to achieve good performance in accessing distributed key-value stores. In summary. whereas in the stripes approach we keep track of all events that co-occur with the same event. • “Order inversion”. In particular. A distributed key-value store can be used to hold one dataset in memory across multiple machines while mapping over the other. • “Value-to-key conversion”. where the main idea is to convert the sequencing of computations into a sorting problem. we can exploit the MapReduce execution framework itself for sorting. this approach is detailed in a technical report [95]. 3.16 The open-source caching system memcached can be used for exactly this purpose. In the pairs approach. they are: • “In-mapper combining”. This is used in all of the above design patterns.68 CHAPTER 3. Constructing complex keys and values that bring together data necessary for a computation. and therefore we’ve dubbed this approach memcached join. an aggregate statistic) before it encounters the data necessary to produce that computation. which provides a scalable solution for secondary sorting. controlling synchronization in the MapReduce programming model boils down to effective use of the following techniques: 1. where the functionality of the combiner is moved into the mapper. Although the stripes approach is significantly more efficient. it requires memory on the order of the size of the event space. which presents a scalability bottleneck.. For more information. Through careful orchestration. • The related patterns “pairs” and “stripes” for keeping track of joint events from a large number of observations. Instead of emitting intermediate output for every input key-value pair. Ultimately. . The mappers would then query this distributed key-value store in parallel and perform joins if the join keys match.6 SUMMARY This chapter provides a guide on the design of MapReduce algorithms.g. MAPREDUCE ALGORITHM DESIGN There is an alternative approach to memory-backed joins for cases where neither datasets fit into memory.

Controlling the sort order of intermediate keys. 3. This is used in order inversion and value-to-key conversion. It should be clear by now that although the programming model forces one to express algorithms in terms of a small set of rigidly-defined components. we will focus on specific classes of MapReduce algorithms: for inverted indexing in Chapter 4. in-mapper combining depends on emission of intermediate key-value pairs in the map task termination code. For example. Executing user-specified initialization and termination code in either the mapper or reducer. Preserving state across multiple inputs in the mapper and reducer. for graph processing in Chapter 5. and value-to-key conversion. SUMMARY 69 2. This is used in order inversion and value-to-key conversion.3. there are many tools at one’s disposal to shape the flow of computation. and for expectation-maximization in Chapter 6. This concludes our overview of MapReduce algorithm design.6. order inversion. In the next few chapters. . This is used in in-mapper combining. 5. 4. Controlling the partitioning of the intermediate key space.

but these are very different from retrieval. users demand results quickly from a search engine—query latencies longer than a few hundred milliseconds will try a user’s patience. bold. it is usually tolerable to have a delay of a few minutes until content is searchable. Given an information need expressed as a short query consisting of a few terms.). Furthermore. government regulations) are relatively static. running smaller-scale index updates at greater frequencies is usually an adequate solution. since the amount of content that changes rapidly is relatively small. How large is the web? It is difficult to compute exactly. However. etc. Fulfilling these requirements is quite an engineering feat. attributes of the terms in the document (e.g. etc. on the 1 Leaving aside the problem of searching live data streams such a tweets.. Indexing is usually a batch process that runs periodically: the frequency of refreshes and updates is usually dependent on the design of the crawler. In real-world applications.g. considering the amounts of data involved! Nearly all retrieval engines for full-text search today rely on a data structure called an inverted index. totaling hundreds of terabytes (considering text alone). Given a user query. news organizations) update their content quite frequently and need to be visited often.g. appears in title. but even a conservative estimate would place the size at several tens of billions of pages. or related metrics such as HITS [84] and SALSA [88]). term proximity.g. which given a term provides access to the list of documents that contain the term. PageRank [117].1 Retrieval. the retrieval engine uses the inverted index to score documents that contain the query terms with respect to some ranking model. Some sites (e. The web search problem decomposes into three components: gathering web content (crawling).70 CHAPTER 4 Inverted Indexing for Text Retrieval Web search is the quintessential large-data problem. the system’s task is to retrieve relevant web objects (web pages. Gathering web content and building inverted indexes are for the most part offline problems. even for rapidly changing sites. PDFs. other sites (e. objects to be retrieved are generically called “documents” even though in actuality they may be web pages.. In information retrieval parlance. Crawling and indexing share similar characteristics and requirements. Both need to be scalable and efficient. PowerPoint slides. which we’ll discuss in Chapter 5. PDF documents..) and present them to the user. which requires different techniques and algorithms. or even fragments of code. but they do not need to operate in real time. construction of the inverted index (indexing) and ranking documents given a query (retrieval). . as well as the hyperlink structure of the documents (e.. taking into account features such as term matches.

which is the process of traversing the web by repeatedly following hyperlinks and storing downloaded pages for subsequent processing.g. this can be relatively straightforward. which leads to a revised version presented in Section 4.. Electronic distribution became common earlier this decade for collections below a certain size.1. we mostly focus on the problem of inverted indexing. we must first acquire the document collection over which these indexes are to be built. This chapter begins by first providing an overview of web crawling (Section 4. A baseline inverted indexing algorithm in MapReduce is presented in Section 4.5. Conceptually. stores the pages for 2 http://boston. We point out a scalability bottleneck in that algorithm.cmu. query loads are highly variable. Furthermore. For researchers who wish to explore web-scale retrieval.edu/Data/clueweb09/ 3 As an interesting side note. On the other hand. .4. on DVDs. Acquiring web content requires crawling.6. is an online problem that demands sub-second response time. the task most amenable to solutions in MapReduce.1 WEB CRAWLING Before building inverted indexes. In academia and for research purposes. a breaking news event triggers a large number of searches on the same topic). A comprehensive treatment of web search is beyond the scope of this chapter. many collections today are so large that the only practical method of distribution is shipping hard drives via postal mail. but query throughput is equally important since a retrieval engine must usually serve many users concurrently.cs. However. extracts links from those pages to add to the queue. which fills in missing details on building compact index structures.2). there is the ClueWeb09 collection that contains one billion web pages in ten languages (totaling 25 terabytes) crawled by Carnegie Mellon University in early 2009. an issue we discuss in Section 4. Index compression is discussed in Section 4. The crawler downloads pages in the queue. however. one cannot simply assume that the collection is already available.4. Since MapReduce is primarily designed for batch-oriented processing. resource consumption for the indexing problem is more predictable. Standard collections for information retrieval research are widely available for a variety of genres ranging from blogs to newswire text. depending on the time of day. Individual users expect low query latencies. it does not provide an adequate solution for the retrieval problem.2 Obtaining access to these standard collections is usually as simple as signing an appropriate data license from the distributor of the collection. 4. WEB CRAWLING 71 other hand. and even this entire book.1) and introducing the basic structure of an inverted index (Section 4. the process is quite simple to understand: we start by populating a queue with a “seed” list of pages. The chapter concludes with a summary and pointers to additional readings. Explicitly recognizing this.lti.3. paying a reasonable fee.3 For real-world web search. research collections were distributed via postal mail on CD-ROMs. and arranging for receipt of the data. and can exhibit “spikey” behavior due to special circumstances (e. and later. in the 1990s.

In order to respect these constraints while maintaining good throughput. There is no guarantee that pages in one language only link to pages in the same language. often geographically distributed. • Since a crawler has finite bandwidth and resources. INVERTED INDEXING FOR TEXT RETRIEVAL further processing. The following lists a number of issues that real-world crawlers must contend with: • A web crawler must practice good “etiquette” and not overload web servers. but not frequent enough leads to stale content. Examples include multiple copies of a popular conference paper. To avoid downloading a page multiple times and to ensure data consistency. • Web content changes. network outages. • The web is multilingual. However. • The web is full of duplicate content. a professor in Asia may maintain her website in the local language. and repeats. navigation bars. For example. it is common practice to wait a fixed amount of time before repeated requests to the same server. . mirrors of frequently-accessed sites such as Wikipedia.72 CHAPTER 4. but with different frequency depending on both the site and the nature of the content. basically the same page but with different ads. etc. in the sense that spammers actively create “link farms” and “spider traps” full of spam pages to trick a crawler into overrepresenting content from a particular site. It also needs to be robust with respect to machine failures. Getting the right recrawl frequency is tricky: too frequent means wasted resources. A web crawler needs to learn these update patterns to ensure that content is reasonably current. • Most real-world web crawlers are distributed systems that run on clusters of machines. In fact. The problem is compounded by the fact that most repetitious pages are not exact duplicates but near duplicates (that is.) It is desirable during the crawling process to identify near duplicates and select the best exemplar to index. effective and efficient web crawling is far more complex. rudimentary web crawlers can be written in a few hundred lines of code. a crawler typically keeps many execution threads running in parallel and maintains many TCP connections (perhaps hundreds) open at the same time. and newswire content that is often duplicated. Such decisions must be made online and in an adversarial environment. the crawler as a whole needs mechanisms for coordination and load-balancing. but contain links to publications in English. and errors of various types. it must prioritize the order in which unvisited pages are downloaded. For example.

g. d5 . to support address searches). many pages contain a mix of text in different languages. indicating that the term is part of a place name. see a recent survey on web crawling [113]. we see that term1 occurs in {d1 .. d4 . is term frequency (tf). d59 . the existence of the posting itself indicates that presence of the term in the document. For more information. .7 provides pointers to additional readings. where n is the total number of documents.1. no additional information is needed in the posting other than the document id.4 The structure of an inverted index is illustrated in Figure 4. The most common payload. 4. and term3 occurs in {d1 . .}. pages that are higher in quality (based. More complex payloads include positions of every occurrence of the term in the document (to support phrase queries and document scoring based on term proximity).2 INVERTED INDEXES In its basic form.}. d11 . . . d11 . Since document processing techniques (e. term is preferred over word since documents are processed (e.2. . see Section 4. tokenization and stemming) into basic units that are often not words in the linguistic sense. In the web context.4. anchor text information (text associated with hyperlinks from other pages to the page in question) is useful in enriching the representation of document content (e. alternatively.5 Generally. . or even the results of additional linguistic processing (for example. In an actual implementation. The document ids have no inherent semantic meaning. an inverted index consists of postings lists. Or. The above discussion is not meant to be an exhaustive enumeration of issues. tokenization.. 5 It is preferable to start numbering the documents at one since it is not possible to code zero with many common compression schemes used in information retrieval. d84 . 4 In . it is important to identify the (dominant) language on a page. on PageRank values) might be assigned smaller numeric values so that they appear toward information retrieval parlance. d19 . [107]). nothing! For simple boolean retrieval. although other sort orders are possible as well.1. For example. we assume that documents can be identified by a unique integer ranging from 1 to n.g.5. for example. d6 . d23 . however. In the example shown in Figure 4. properties of the term (such as if it occurred in the page title or not. . or the number of times the term occurs in the document. Section 4.}. . but rather to give the reader an appreciation of the complexities involved in this intuitively simple task. stemming) differ by language. . . one associated with each term that appears in the collection.g. although assignment of numeric ids to documents need not be arbitrary. . term2 occurs in {d11 . postings are sorted by document id. this information is often stored in the index as well. A postings list is comprised of individual postings.. to support document ranking based on notions of importance). each of which consists of a document id and a payload—information about occurrences of the term in the document. INVERTED INDEXES 73 Furthermore. The simplest payload is. pages from the same domain may be consecutively numbered.

Query evaluation. An inverted index provides quick access to documents ids that contain a term. a well-optimized inverted index can be a tenth of the size of the original document collection. with the exception of well-resourced. there are many optimization strategies for query evaluation (both approximate and exact) that reduce the number of postings a retrieval engine must examine. At the end (i.. In the simplest case.terms term1 term2 term3 … postings d1 d11 d1 … p p p d5 d23 d4 p p p d6 d59 d11 p p p d11 d84 d19 p p p … … … 74 CHAPTER 4. Given a query. INVERTED INDEXING FOR TEXT RETRIEVAL terms term1 term2 term3 … postings d1 d11 d1 … p p p d5 d23 d4 p p p d6 d59 d11 p p p d11 d84 d19 p p p … … … Figure 4. Either way. dictionary of all the terms) in memory.6 postings lists are usually too large to store in memory and must be held on disk. However. especially with techniques such as front-coding [156]. Partial document scores are stored in structures called accumulators. retrieval involves fetching postings lists associated with query terms and traversing the postings to compute the result set. commercial web search engines.e. once all postings have been processed). One important aspect of the retrieval problem is to organize disk operations such that random seeks are minimized. Each term is associated with a list of postings. Each posting is comprised of a document id and a payload. An inverted index that stores positional information would easily be several times larger than one that does not. Generally. depending on the payload stored in each posting. query–document scores must be computed. the top k documents are then extracted to yield a ranked list of results for the user.e. boolean retrieval involves set operations (union for boolean OR and intersection for boolean AND) on postings lists. 6 Google keeps indexes in memory. . therefore. In the general case.1: Simple illustration of an inverted index.. however. an auxiliary data structure is necessary to maintain the mapping from integer document ids to some other more meaningful handle. which can be accomplished very efficiently since the postings are sorted by document id. If only term frequency is stored. The size of an inverted index varies. it is possible to hold the entire vocabulary (i. the front of a postings list.5). Of course. necessarily involves random disk access and “decoding” of the postings. usually in compressed form (more details in Section 4. such as a URL. denoted by p in this case.

f1 .] do Append(P. n2 .3 INVERTED INDEXING: BASELINE IMPLEMENTATION MapReduce was designed from the very beginning to produce the various data structures involved in web search. n2 . Section 4. the execution framework groups postings by term. our goal is to provide the reader with an overview of the important issues. but for web pages typically involves stripping out HTML tags and other elements such as JavaScript code. each document is analyzed and broken down into its component terms. Mappers emit postings keyed by terms. f1 . However. postings [ n1 .).2. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(term t. postings P ) Figure 4. this brief discussion glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. including inverted indexes and the web graph. We begin with the basic inverted indexing algorithm shown in Figure 4. First. INVERTED INDEXING: BASELINE IMPLEMENTATION 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: 75 class Mapper procedure Map(docid n.3. . Once the document has been analyzed. . Lines 4 and 5 in the pseudo-code reflect the process of computing term frequencies. f2 . etc. and the reducers write postings lists to disk. a. f2 . . removing stopwords (common words such as ‘the’. f ) Sort(P ) Emit(term t. ‘a’. but hides the details of document processing. posting n.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. ‘of’. Individual documents are processed in parallel by the mappers. term frequencies are computed by iterating over all the terms and keeping track of counts. 4. .4. The processing pipeline differs depending on the application and type of document.]) P ← new List for all posting a. Input to the mapper consists of document ids (keys) paired with the actual content (values). f ∈ postings [ n1 . tokenizing. . and stemming (removing affixes from words so that ‘dogs’ becomes ‘dog’). H{t} ) class Reducer procedure Reduce(term t. case folding. Once again.7 provides references to additional readings.

we must also build an index to the postings lists themselves for the retrieval engine: this is typically in the form of mappings from term to (file. In MapReduce.g. and two reducers. . Execution of the complete algorithm is illustrated in Figure 4. Intermediate key-value pairs (from the mappers) and the final key-value pairs comprising the inverted index (from the reducers) are shown in the boxes with dotted lines. The postings are then sorted by document id. The final key-value pairs are written to disk and comprise the inverted index. memory capacity is limited. in line 7 of the mapper pseudo-code. and the entire postings list is emitted as a value. three mappers. Such an implementation can be completed as a week-long programming assignment in a course for advanced undergraduates or first-year graduate students [83. The mapper then emits an intermediate key-value pair with the term as the key and the posting as the value.4 for more details). For each term. the mapper then iterates over all terms. Separately. etc. Each pair. The reducer begins by initializing an empty list and then appends all postings associated with the same key (term) to the list. 93]. where r is the number of reducers. the execution framework brings together all the postings that belong in the same postings list. There is no need to further consolidate these files.). Typically. In a non-MapReduce indexer. represents an individual posting. INVERTED INDEXING FOR TEXT RETRIEVAL After this histogram has been built. Although as presented here only the term frequency is stored in the posting.3 with a toy example consisting of three documents. distributed group by of the postings by term. Without any additional effort by the programmer. given constraints imposed by memory and disk (e. Its implementation is similarly concise: the basic algorithm can be implemented in as few as a couple dozen lines of code in Hadoop (with minimal document processing). the programmer does not need to worry about any of these issues—most of the heavy lifting is performed by the execution framework. the retrieval engine can fetch its postings list by opening the appropriate file and seeking to the correct byte offset position in that file. but we leave this aside for now (see Section 4. Since each reducer writes its output in a separate file in the distributed file system.. with the term as the key. term positions) in the payload. so that given a term.g.76 CHAPTER 4. byte offset) pairs. the postings list is first compressed. a pair consisting of the document id and the term frequency is created. In the shuffle and sort phase. denoted by n. with the document id on the left and the term frequency on the right. The MapReduce programming model provides a very concise expression of the inverted indexing algorithm. disk seeks are slow. Postings are shown as pairs of boxes. our final index will be split across r files. this algorithm can be easily augmented to store additional information (e. which simply needs to gather together all the postings and write them to disk. H{t} in the pseudo-code. a significant fraction of the code is devoted to grouping postings by term. the MapReduce runtime essentially performs a large. This tremendously simplifies the task of the reducer..

and at some point in time. two fish doc 2 red fish. Since the execution framework guarantees that keys arrive at each reducer in sorted order. one way to overcome the scalability 7 See similar discussion in Section 3.4.4: in principle. the reducer first buffers all postings (line 5 of the reducer pseudo-code in Figure 4. It is entirely possible to implement a disk-based sort within the reducer.2) and then performs an in-memory sort before writing the postings to disk. However. . blue fish doc 3 one red bird 77 mapper mapper mapper fish one two d1 d1 d1 2 1 1 blue fish red d2 d2 d2 1 2 1 bird one red d3 d3 d3 1 1 1 Shuffle and Sort: aggregate values by keys reducer reducer fish one two d1 d1 d1 2 1 1 d2 d3 2 1 bird blue red d3 d2 d2 1 1 1 d3 1 Figure 4. as collections become larger.4 INVERTED INDEXING: REVISED IMPLEMENTATION The inverted indexing algorithm presented in the previous section serves as a reasonable baseline.4. Since the basic MapReduce execution framework makes no guarantees about the ordering of values associated with the same key.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with three mappers and two reducers. However. 4. tf). this need not be an in-memory sort. there is a significant scalability bottleneck: the algorithm assumes that there is sufficient memory to hold all postings associated with the same term. reducers will run out of memory. postings need to be sorted by document id. Postings are shown as pairs of boxes (docid.7 For efficient retrieval. There is a simple solution to this problem. postings lists grow longer. INVERTED INDEXING: REVISED IMPLEMENTATION doc 1 one fish.

Finally. The mapper remains unchanged for the most part. there is no worry about insufficient memory to hold these data. in the Close method). this task can actually be folded into the inverted indexing process. This. With this modification. When processing the terms in each document. the entire postings list is emitted. f ) We emit intermediate key-value pairs of the type instead: (tuple t.e. while the value is the term frequency. t = tprev ). the programming model ensures that the postings arrive in the correct order. these files are then consolidated into a more compact 8 In general. For each key-value pair. Since almost all retrieval models take into account document length when computing query– document scores.g. Document length data ends up in m different files.. Since the postings are guaranteed to arrive in sorted order by document id. Although it is straightforward to express this computation as another MapReduce job. they can be incrementally coded in compressed form— thus ensuring a small memory footprint. posting docid. there will only be one value associated with each key. . This approach is essentially a variant of the inmapper combining pattern.4.8 When the mapper finishes processing input records. and can be written out as “side data” directly to HDFS. We can take advantage of the ability for a mapper to hold state across the processing of multiple documents in the following manner: an in-memory associative array is created to store document lengths. As with the baseline algorithm. the key is a tuple containing the term and the document id. document lengths are written out to HDFS (i. t. term positional information). Instead of emitting key-value pairs of the following type: (term t. docid . other than differences in the intermediate key-value pairs. combined with the fact that reducers can hold state across multiple keys.78 CHAPTER 4. the document length is known. As a detail.e. allows postings lists to be created with minimal memory usage. tf f ) In other words. INVERTED INDEXING FOR TEXT RETRIEVAL bottleneck is to let the MapReduce runtime do the sorting for us. this information must also be extracted. The final postings list must be written out in the Close method. when all postings associated with the same term have been processed (i. and by design. where m is the number of mappers.4. There is one more detail we must address when building inverted indexes. a posting can be directly added to the postings list. This is exactly the value-to-key conversion design pattern introduced in Section 3. The revised MapReduce inverted indexing algorithm is shown in Figure 4.. payloads can be easily changed: by simply replacing the intermediate value f (term frequency) with whatever else is desired (e.. The Reduce method is called for each key (i. which is populated as each document is processed. remember that we must define a custom partitioner to ensure that all tuples with the same term are shuffled to the same reducer. n ).e..

n . INDEX COMPRESSION 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 79 class Mapper method Map(docid n. Otherwise. Let us consider the canonical case where each posting consists of a document id and the term frequency.Reset() P. 4.Add( n.5 INDEX COMPRESSION We return to the question of how postings are actually compressed and stored on disk. the execution framework is exploited to sort postings so that they arrive sorted by document id in the reducer. Alternatively. representation. tf H{t}) class Reducer method Initialize tprev ← ∅ P ← new PostingsList method Reduce(tuple t. One must then write a custom partitioner so that these special key-value pairs are shuffled to a single reducer. MapReduce inverted indexing algorithms are pretty straightforward. n . By applying the value-to-key conversion design pattern.4. f ) tprev ← t method Close Emit(term t. tf [f ]) if t = tprev ∧ tprev = ∅ then Emit(term t. A na¨ implementation might represent the first as a 32-bit ıve . document length information can be emitted in special key-value pairs by the mapper. postings P ) P. doc d) H ← new AssociativeArray for all term t ∈ doc d do H{t} ← H{t} + 1 for all term t ∈ H do Emit(tuple t.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. This chapter devotes a substantial amount of space to this topic because index compression is one of the main differences between a “toy” indexer and one that works on real-world collections. which will be responsible for writing out the length data separate from the postings lists. postings P ) Figure 4.5.

. particularly for coding integers. an integer is encoded in four bytes and holds a value between 0 and 232 − 1. inclusive. Fortunately.967. can be characterized as either lossless or lossy: it’s fairly obvious that loseless compression is required in this context. in general. 2). The above postings list. Therefore.] Of course. if the two factors are properly balanced (i. which is to represent the d-gaps in a way such that it takes less space for smaller numbers.295. We limit our discussion to unsigned integers. The first trick is to encode differences between document ids as opposed to the document ids themselves. That is.80 CHAPTER 4. . it is possible that compression reduces size but also slows processing. 1). we need to take a slight detour into compression techniques. we want to apply the same techniques to compress the term frequencies.] where each posting is represented by a pair in parentheses. This means that 1 and 4. To start. would be: [(5. 4. (2. 2). but at the cost of extra processor cycles that must be spent coding and decoding data. and commas are only included to enhance readability. it’s not obvious that we’ve reduced the space requirements either. This na¨ implementation would ıve require six bytes per posting. a postings list might be encoded as follows: [(5. This is where the second trick comes in. Since the postings are sorted by document ids. the entire inverted index would be about as large as the collection itself. Thus. (2..5. (37. we can achieve the best of both worlds: smaller and faster. decoding speed can keep up with disk bandwidth). 1). in reality the postings would be represented as a long stream of integers. (5.294. we must actually encode the first document id. Similarly. we can do significantly better. we reduce the amount of space on disk necessary to store data. (12. since for the most part they are also small values. the differences (called d-gaps) must be positive integers greater than zero.295 9 However.294. However. 1). However. since the original document ids can be easily reconstructed from the d-gaps. 3). We haven’t lost any information. . INVERTED INDEXING FOR TEXT RETRIEVAL integer9 and the second as a 16-bit integer. . . it is important to understand that all compression techniques represent a time–space tradeoff. since dgaps are always positive (and greater than zero).e. 1). (7. (49.1 BYTE-ALIGNED AND WORD-ALIGNED CODES In most programming languages. parentheses. (51. 2). 3). note that 232 − 1 is “only” 4. . which is much less than even the most conservative estimate of the size of the web.967. . represented with d-gaps. since the largest possible d-gap is one less than the number of documents in the collection. Using this scheme. Compression. But to understand how this is done. 2). Note that all brackets.

encoding d-gaps this way doesn’t yield any reductions in size. Of course.google. However. the continuation bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps). it makes sense to store postings in increments 10 http://research. This is known as variable-length integer coding (varInt for short) and accomplished by using the high order bit of every byte as the continuation bit. and three bytes. 27 ≤ n < 214 with 2 bytes.e. A simple lookup table based on the prefix byte directs the decoder on how to process subsequent bytes to recover the coded integers. As a concrete example. the following prefix byte: 00.10 indicates that the following four integers are one byte. 128 would be coded as such: 1 1111111. one byte. For example. and the second consisting of 2 bytes. A variant of the varInt scheme was described by Jeff Dean in a keynote talk at the WSDM 2009 conference. The advantage of this group varInt coding scheme is that values can be decoded with fewer branch mispredicts and bitwise operations.00. accessing entire machine words is more efficient than fetching all its bytes separately. shifts). two bytes.5.. the comma and the spaces are there only for readability. Experiments reported by Dean suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme.10 The insight is to code groups of four integers at a time. the first consisting of 1 byte. respectively.4. As a result. which is set to one in the last byte and zero elsewhere. beyond 4 bytes). 214 ≤ n < 221 with 3. which means that 0 ≤ n < 27 can be expressed with 1 byte. each group of four integers would consume anywhere between 5 and 17 bytes. Therefore. we have 7 bits per byte for coding the value.01. INDEX COMPRESSION 81 both occupy four bytes. As a result. Obviously. Furthermore. Each group begins with a prefix byte. the downside of varInt coding is that decoding involves lots of bit operations (masks.pdf . In most architectures. Variable-length integers are byte-aligned because the code words always fall along byte boundaries. which slows down processing. This scheme can be extended to code arbitrarily-large integers (i. Therefore. there is never any ambiguity about where one code word ends and the next begins. the two numbers: 127. and 221 ≤ n < 228 with 4 bytes. 0 0000001 1 0000000 The above code contains two code words. A simple approach to compression is to only use as many bytes as is necessary to represent the integer. divided into four 2-bit values that specify the byte length of each of the following integers.com/people/jeff/WSDM09-keynote.

Coding works in the opposite way: the algorithm scans ahead to see how many integers can be squeezed into 28 bits. meaning that boundaries can fall anywhere. all the way up to twenty-eight 1-bit integers. One additional challenge with bit-aligned codes is that we need a mechanism to delimit code words. in which no valid code word is a prefix of any other valid code word. The downside. In fact. To address this issue..5. Therefore. 1. An integer x > 0 is coded as x − 1 one bits followed by a zero bit. there are nine different ways the 28 bits can be divided into equal parts (hence the name of the technique). is that they must consume multiples of eight bits.e. decoding involves reading a 32-bit word. since there are no byte boundaries to guide us. This is stored in the selector bits.82 CHAPTER 4. even when fewer bits might suffice (the Simple-9 scheme gets around this by packing multiple integers into a 32-bit word. {00. In bit-aligned codes. The remaining 28 bits are used to code actual integer values. One of the simplest prefix codes is the unary code. but even then. most bit-aligned codes are socalled prefix codes (confusingly. they are also called prefix-free codes). four bits in each 32-bit word are reserved as a selector. and so we can’t tell if 01 is two code words or one. however. on the other hand. For example. On the other hand. 01} is not a valid prefix code. 1} is a valid prefix code. code words can occupy any number of bits. packs those integers. tell where the last ends and the next begins. two 14-bit integers. coding and decoding bit-aligned codes require processing bytes and appropriately shifting or masking bits (usually more involved than varInt and group varInt coding). etc. In this coding scheme. 01. one of which is called Simple-9. and sets the selector bits appropriately. INVERTED INDEXING FOR TEXT RETRIEVAL of 16-bit.. 32-bit. some with leftover unused bits. based on 32-bit words. there are a variety of ways these 28 bits can be divided to code one or more integers: 28 bits can be used to code one 28-bit integer. Note that unary codes do not allow the representation .2 BIT-ALIGNED CODES The advantage of byte-aligned and word-aligned codes is that they can be coded and decoded quickly. examining the selector to see how the remaining 28 bits are packed. Anh and Moffat [8] presented several wordaligned coding methods. i. such that a sequence of bits: 0001101001010100 can be unambiguously segmented into: 00 01 1 01 00 1 01 01 00 and decoded without any additional delimiters. In practice. and then appropriately decoding each integer. since 0 is a prefix of 01. or 64-bit machine words. coding 0 ≤ x < 3 with {0. 4. Now. bits are often wasted). three 9-bit integers (with one bit unused).

INDEX COMPRESSION 83 x 1 2 3 4 5 6 7 8 9 10 unary 0 10 110 1110 11110 111110 1111110 11111110 111111110 1111111110 γ 0 10:0 10:1 110:00 110:01 110:10 110:11 1110:000 1110:001 1110:010 Golomb b=5 0:00 0:01 0:10 0:110 0:111 10:00 10:01 10:10 10:110 10:111 b = 10 0:000 0:001 0:010 0:011 0:100 0:101 0:1100 0:1101 0:1110 0:1111 Figure 4. we read a unary code cu . With unary code we can code x in x bits. which makes them more compact than variable-length integer codes. γ codes make sense for them and can yield substantial space savings. which maps x to the largest integer not greater than x. but form a component of other coding schemes.4. For x < 16. For reference. An integer x > 0 is broken into two components. and x − 2 log2 x (= r. the length). 12 Note that x is the floor function. which is in binary. 10) codes. e. and the binary component codes the remainder r in n − 1 bits. and Golomb (b = 5. 3.g. the γ codes of the first ten positive integers are shown in Figure 4.5. Here. so. A 11 As a note.5: The first ten positive integers in unary. Since term frequencies for the most part are relatively small. We can then reconstruct x as 2cu −1 + cb . Putting both together..5. of zero. Unary codes are rarely used by themselves. which although economical for small values. This tells us that the binary portion is written in cu − 1 bits.5. γ. we adopt the conventions used in the classic IR text Managing Gigabytes [156]. Unary codes of the first ten positive integers are shown in Figure 4. it’s not part of the final code. some sources describe slightly different formulations of the same coding scheme. the remainder). consider x = 10: 1 + log2 10 = 4. which we then read as cb . becomes inefficient for even moderately large values. it is easy to unambiguously decode a bit stream of γ codes: First. of course. Working in reverse. The extra colon is inserted only for readability. The binary component codes x − 23 = 2 in 4 − 1 = 3 bits. .12 The unary component n specifies the number of bits required to code x. This is the default behavior in many programming languages when casting from a floating-point type to an integer type.11 As an example. Elias γ code is an efficient coding scheme that is widely used in practice. which is 010. which is coded in unary code. γ codes occupy less than a full byte. which is 1110. which is fine since d-gaps and term frequencies should never be zero.8 = 3. 1 + log2 x (= n. 4 in unary code is 1110. As an example. we arrive at 1110:010. which is a prefix code.

84

CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

variation on γ code is δ code, where the n portion of the γ code is coded in γ code itself (as opposed to unary code). For smaller values γ codes are more compact, but for larger values, δ codes take less space. Unary and γ codes are parameterless, but even better compression can be achieved with parameterized codes. A good example of this is Golomb code. For some parameter b, an integer x > 0 is coded in two parts: first, we compute q = (x − 1)/b and code q + 1 in unary; then, we code the remainder r = x − qb − 1 in truncated binary. This is accomplished as follows: if b is a power of two, then truncated binary is exactly the same as normal binary, requiring log2 b bits. Otherwise, we code the first 2 log2 b +1 − b values of r in log2 b bits and code the rest of the values of r by coding r + 2 log2 b +1 − b in ordinary binary representation using log2 b + 1 bits. In this case, the r is coded in either log2 b or log2 b + 1 bits, and unlike ordinary binary coding, truncated binary codes are prefix codes. As an example, if b = 5, then r can take the values {0, 1, 2, 3, 4}, which would be coded with the following code words: {00, 01, 10, 110, 111}. For reference, Golomb codes of the first ten positive integers are shown in Figure 4.5 for b = 5 and b = 10. A special case of Golomb code is worth noting: if b is a power of two, then coding and decoding can be handled more efficiently (needing only bit shifts and bit masks, as opposed to multiplication and division). These are known as Rice codes. Researchers have shown that Golomb compression works well for d-gaps, and is optimal with the following parameter setting: df (4.1) N where df is the document frequency of the term, and N is the number of documents in the collection.13 Putting everything together, one popular approach for postings compression is to represent d-gaps with Golomb codes and term frequencies with γ codes [156, 162]. If positional information is desired, we can use the same trick to code differences between term positions using γ codes. b ≈ 0.69 × 4.5.3 POSTINGS COMPRESSION Having completed our slight detour into integer compression techniques, we can now return to the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings lists can be properly compressed. As we can see from the previous section, there is a wide range of choices that represent different tradeoffs between compression ratio and decoding speed. Actual performance also depends on characteristics of the collection, which, among other factors, determine the distribution of d-gaps. B¨ttcher u
13 For

details as to why this is the case, we refer the reader elsewhere [156], but here’s the intuition: under reasonable assumptions, the appearance of postings can be modeled as a sequence of independent Bernoulli trials, which implies a certain distribution of d-gaps. From this we can derive an optimal setting of b.

4.5. INDEX COMPRESSION

85

et al. [30] recently compared the performance of various compression techniques on coding document ids. In terms of the amount of compression that can be obtained (measured in bits per docid), Golomb and Rice codes performed the best, followed by γ codes, Simple-9, varInt, and group varInt (the least space efficient). In terms of raw decoding speed, the order was almost the reverse: group varInt was the fastest, followed by varInt.14 Simple-9 was substantially slower, and the bit-aligned codes were even slower than that. Within the bit-aligned codes, Rice codes were the fastest, followed by γ, with Golomb codes being the slowest (about ten times slower than group varInt). Let us discuss what modifications are necessary to our inverted indexing algorithm if we were to adopt Golomb compression for d-gaps and represent term frequencies with γ codes. Note that this represents a space-efficient encoding, at the cost of slower decoding compared to alternatives. Whether or not this is actually a worthwhile tradeoff in practice is not important here: use of Golomb codes serves a pedagogical purpose, to illustrate how one might set compression parameters. Coding term frequencies with γ codes is easy since they are parameterless. Compressing d-gaps with Golomb codes, however, is a bit tricky, since two parameters are required: the size of the document collection and the number of postings for a particular postings list (i.e., the document frequency, or df). The first is easy to obtain and can be passed into the reducer as a constant. The df of a term, however, is not known until all the postings have been processed—and unfortunately, the parameter must be known before any posting is coded. At first glance, this seems like a chicken-and-egg problem. A two-pass solution that involves first buffering the postings (in memory) would suffer from the memory bottleneck we’ve been trying to avoid in the first place. To get around this problem, we need to somehow inform the reducer of a term’s df before any of its postings arrive. This can be solved with the order inversion design pattern introduced in Section 3.3 to compute relative frequencies. The solution is to have the mapper emit special keys of the form t, ∗ to communicate partial document frequencies. That is, inside the mapper, in addition to emitting intermediate key-value pairs of the following form: (tuple t, docid , tf f ) we also emit special intermediate key-value pairs like this: to keep track of document frequencies associated with each term. In practice, we can accomplish this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an in-memory associative array that keeps track of how many documents a term has been observed in (i.e., the local document frequency of the term for
14 However,

(tuple t, ∗ , df e)

this study found less speed difference between group varInt and basic varInt than Dean’s analysis, presumably due to the different distribution of d-gaps in the collections they were examining.

86

CHAPTER 4. INVERTED INDEXING FOR TEXT RETRIEVAL

the subset of documents processed by the mapper). Once the mapper has processed all input records, special keys of the form t, ∗ are emitted with the partial df as the value. To ensure that these special keys arrive first, we define the sort order of the tuple so that the special symbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for each term, the reducer will first encounter the t, ∗ key, associated with a list of values representing partial df values originating from each mapper. Summing all these partial contributions will yield the term’s df, which can then be used to set the Golomb compression parameter b. This allows the postings to be incrementally compressed as they are encountered in the reducer—memory bottlenecks are eliminated since we do not need to buffer postings in memory. Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is useful when a reducer needs to access the result of a computation (e.g., an aggregate statistic) before it encounters the data necessary to produce that computation. For computing relative frequencies, that bit of information was the marginal. In this case, it’s the document frequency.

4.6

WHAT ABOUT RETRIEVAL?

Thus far, we have briefly discussed web crawling and focused mostly on MapReduce algorithms for inverted indexing. What about retrieval? It should be fairly obvious that MapReduce, which was designed for large batch operations, is a poor solution for retrieval. Since users demand sub-second response times, every aspect of retrieval must be optimized for low latency, which is exactly the opposite tradeoff made in MapReduce. Recall the basic retrieval problem: we must look up postings lists corresponding to query terms, systematically traverse those postings lists to compute query–document scores, and then return the top k results to the user. Looking up postings implies random disk seeks, since for the most part postings are too large to fit into memory (leaving aside caching and other special cases for now). Unfortunately, random access is not a forte of the distributed file system underlying MapReduce—such operations require multiple round-trip network exchanges (and associated latencies). In HDFS, a client must first obtain the location of the desired data block from the namenode before the appropriate datanode can be contacted for the actual data. Of course, access will typically require a random disk seek on the datanode itself. It should be fairly obvious that serving the search needs of a large number of users, each of whom demand sub-second response times, is beyond the capabilities of any single machine. The only solution is to distribute retrieval across a large number of machines, which necessitates breaking up the index in some manner. There are two main partitioning strategies for distributed retrieval: document partitioning and term partitioning. Under document partitioning, the entire collection is broken up into multiple smaller sub-collections, each of which is assigned to a server. In other words, each

6: Term–document matrix for a toy collection (nine documents. Document and term partitioning require different retrieval strategies and represent different tradeoffs. b. Retrieval under document partitioning involves a query broker. and then returns the final results to the user. searching the entire collection requires that the query be processed by every partition server. a server holds the postings for all documents in the collection for a subset of terms. With term partitioning. requires a very different strategy. Suppose the user’s query Q contains three terms. server holds the complete index for a subset of the entire collection. However.6. c) corresponds to term partitioning.6. This corresponds to partitioning vertically in Figure 4. which forwards the user’s query to all partition servers.6. That is. 3) corresponds to document partitioning. on the other hand. on the other hand. nine terms) illustrating different partitioning strategies: partitioning vertically (1. Under the . and q3 . With this architecture. whereas partitioning horizontally (a. 2. document partitioning typically yields shorter query latencies (compared to a single monolithic index with much longer postings lists). WHAT ABOUT RETRIEVAL? 87 d1 t1 t2 t3 t4 t5 t6 t7 t8 t9 2 1 d2 d3 2 1 d4 d5 d6 d7 d8 3 d9 1 2 5 2 1 1 3 4 partitiona 1 2 partitionb 1 2 1 1 1 2 2 3 4 partitionc 1 partition1 partition2 partition3 Figure 4. q1 .4. Retrieval under term partitioning. merges partial results from each. each server is responsible for a subset of the terms for the entire collection. q2 . since each partition operates independently and traverses postings in parallel. This corresponds to partitioning horizontally in Figure 4.

stored in the accumulators. Although this query evaluation strategy may not substantially reduce the latency of any particular query. studies have shown that document partitioning is a better strategy overall [109]. Furthermore. and then to the server for q3 . However. and only backs off to partitions containing lower quality documents if necessary. The server traverses the appropriate postings list and computes partial query–document scores. Proper partitioning of the collection can address one major weakness of this architecture. Along one dimension. and this is the strategy adopted by Google [16]. users might not even be aware that documents are missing. the web contains more relevant documents than any user has time to digest: users of course care about getting relevant documents (sometimes. Partitioning by document quality supports a multi-phase search strategy: the system examines partitions containing high quality documents first.88 CHAPTER 4. Along an orthogonal dimension. The accumulators are then passed to the server that holds the postings associated with q2 for additional processing. it is desirable to partition documents by content (perhaps also guided by the distribution of user queries from logs). they are happy with a single relevant document). INVERTED INDEXING FOR TEXT RETRIEVAL pipelined query evaluation strategy. With sufficient partitions. load-balancing is tricky in a pipelined term-partitioned architecture due to skew in the distribution of query terms. there are a variety of approaches to dividing up the web into smaller pieces. which can create “hot spots” on servers that hold the postings for frequently-occurring query terms. it can theoretically increase a system’s throughput due to the far smaller number of total disk seeks required for each user query (compared to document partitioning). Note that partitions may be unavailable due to reasons other than machine failure: cycling through different partitions is a very simple and non-disruptive strategy for index updates. which is that every partition server is involved in every user query. This reduces the number of servers that need to be contacted for a user query. In general. Partition servers that are offline will simply fail to deliver results for their subsets of the collection. For most queries. One key advantage of document partitioning is that result quality degrades gracefully with machine failures. the broker begins by forwarding the query to the server that holds the postings for q1 (usually the least frequent term). see [124] for a recent survey on web page classification. This also reduces the number of machines that need to be involved in serving a user’s query: the broker can direct queries only to . it is known that Google maintains its indexes in memory (although this is certainly not the common case for search engines in general). so that each partition is “well separated” from the others in terms of topical coverage. Working in a document-partitioned architecture. but they are generally less discriminating when it comes to which relevant documents appear in their results (out of the set of all relevant documents). it is desirable to partition by document quality using one or more classifiers. before final results are passed back to the broker and returned to the user.

query evaluation can benefit immensely from caching. to generate meaningful output for user presentation. but also replication across geographically-distributed datacenters.4) addressed the issue by offloading the task of sorting postings by document id to the MapReduce execution framework. index structures must be accessed and processed in response to user queries to generate search results. First. which yield postings lists that are both more compact and faster to process. as opposed to forwarding the user query to all the partitions. Search engines take advantage of this with cache servers. 4. inverted indexing is nothing but a very large distributed sort and group by operation! We began with a baseline implementation of an inverted indexing algorithm. but quickly noticed a scalability bottleneck that stemmed from having to buffer postings in memory. As a specific example.7 SUMMARY AND ADDITIONAL READINGS Web search is a complex problem that breaks down into three conceptually-distinct components. Both of these can be considered offline problems. Second. Next. inverted indexes and other auxiliary data structures must be built from the documents. Therefore. This last task is an online problem that demands both low latency and high throughput. After all. a snippet of the source document showing the user’s query terms in context. functionally distinct from the partition servers holding the indexes. the system needs to properly balance load across replicas. raw retrieval results consist of a ranked list of semantically meaningless document ids. typically comprising the title and URL of the page. There are two final components of real-world search engines that are worth discussing.4. both in terms of multiple machines serving the same partition within a single datacenter. We also surveyed various techniques for integer compression. Application of the value-to-key conversion design pattern (Section 3. and computes an appropriate result entry. reliability of service is provided by replication. recall that postings only store document ids. Finally. Abstractly.7. with very frequent queries at the head of the distribution dominating the total number of queries. SUMMARY AND ADDITIONAL READINGS 89 the partitions that are likely to contain relevant documents. the problem most suitable for MapReduce. This creates at least two query routing problems: since it makes sense to serve clients from the closest datacenter. It is typically the responsibility of document servers. of individual postings (assuming that the index is not already in memory) and even results of entire queries [13]. This chapter primarily focused on building inverted indexes. a document server takes as input a query and a document id. which are functionally distinct from all of the components discussed above. Within a single datacenter. a service must route queries to the appropriate location. and additional metadata about the document. On a large-scale. one could use Golomb codes for . the documents collection must be gathered (by crawling the web). This is made possible by the Zipfian distribution of queries. First.

Here. Finally. [30] is noteworthy in u having detailed experimental evaluations that compare the performance (both effectiveness and efficiency) of a wide range of algorithms and techniques. INVERTED INDEXING FOR TEXT RETRIEVAL compressing d-gaps and γ codes for term frequencies. 15 http://research.3 for computing relative frequencies can be used to properly set compression parameters. ACM SIGIR is an annual conference and the most prestigious venue for academic information retrieval research. A keynote talk at the WSDM 2009 conference by Jeff Dean revealed a lot of information about the evolution of the Google search architecture. We showed how the order inversion design pattern introduced in Section 3. Another by Baeza-Yates et al. Our brief discussion of web search glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. Of these three. 42. we provide a few entry points into the literature. 30].pdf .90 CHAPTER 4. Additional Readings.15 Finally. a number of general information retrieval textbooks have been recently published [101. however. proceedings from those events are perhaps the best starting point for those wishing to keep abreast of publicly-documented developments in the field. the textbook Managing Gigabytes [156] remains an excellent source for index compression techniques.google. the one by B¨ttcher et al. While outdated in many other respects. A survey article by Zobel and Moffat [162] is an excellent starting point on indexing and retrieval algorithms.com/people/jeff/WSDM09-keynote. [11] overviews many important issues in distributed retrieval.

which is used in ranking results for web search. we use node interchangeably with vertex and similarly with link and edge. and transportation networks (roads. Over the past few centuries. in a social network where nodes represent individuals. As one of the first applications of MapReduce. phone call patterns. documents frequently exist in the context of some underlying network. dating back to Euler’s paper on the Seven Bridges of K¨nigsberg in 1736. .1 These connections can be directed or undirected. gender. and today much is known about their properties.g. social networks (manifest in the flow of email.. bus routes. Far more than theoretical curiosities. in others. PageRank exemplifies a large class of graph algorithms that can be concisely captured in the programming model. proteins. and other cellular products. whenever anyone searches for directions on the web. In some graphs. indicating type of relationship such as “friend” or “spouse”). age. graphs can be characterized by nodes (or vertices) and links (or edges) that connect pairs of nodes. We assume that both nodes and links may be annotated with additional metadata: as a simple example.). In general. there might be demographic information (e. graphs o have been extensively studied. Search algorithms on graphs are invoked millions of times a day. a measure of web page quality based on the structure of hyperlinks.91 CHAPTER 5 Graph Algorithms Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily basis include the hyperlink structure of the web (simply known as the web graph). etc. Perhaps the best known example is PageRank. there may be an edge from a node to itself.). making graph analysis an important component of many text processing applications.. such edges are disallowed. 1 Throughout this chapter. We will discuss PageRank in detail later this chapter. Similar algorithms are also involved in friend recommendations and expert-finding in social networks. Although most of the content has nothing to do with text processing per se. theorems and algorithms on graphs can be applied to solve many real-world problems: • Graph search and path planning. complex graph involving interactions between genes. Our very own existence is dependent on an intricate metabolic and regulatory network. resulting in a self loop. Path planning problems involving everything from network packets to delivery trucks represent another large class of graph search problems. This chapter focuses on graph algorithms in MapReduce. Mathematicians have always been fascinated with graphs. flights. connections on social networking sites. location) attached to the nodes and type information attached to the links (e.g. etc. which can be characterized as a large.

These special nodes are important to investigators attempting to break up terrorist cells.) and network operators grapple with complex versions of these problems on a daily basis. the hyperlink structure of the web. Google recently published a short description of a system called Pregel [98]. the max flow problem involves computing the amount of “traffic” that can be sent from source to sink given various flow capacities defined by edge weights. • Minimum spanning trees. based on Valiant’s Bulk Synchronous Parallel model [148]. A bipartite graph is one whose vertices can be divided into two disjoint sets. and relationship to cluster structure. See [158] for a survey. including metrics based on node in-degree. We’d like to put MapReduce to work on these challenges. Can a large graph be divided into components that are relatively disjoint (for example. A minimum spanning tree for a graph G with weighted edges is a tree that contains all vertices of the graph and a subset of edges that minimizes the sum of edge weights. This approach has also been applied to wide variety of problems. for large-scale graph algorithms.92 CHAPTER 5. advertisers trying to promote products. or social networks that contain hundreds of millions of individuals. GRAPH ALGORITHMS • Graph clustering. A common feature of these problems is the scale of the datasets on which the algorithms must operate: for example. and then explore two classic graph algorithms in MapReduce: 2 As a side note. In a weighted directed graph with two special nodes called the source and the sink. including social networks and the migration of Polynesian islanders [64]. A real-world example of this problem is a telecommunications company that wishes to lay optical fiber to span a number of destinations at the lowest possible cost (where weights denote costs). • Identifying “special” nodes. shipping. as measured by inter-component links [59])? Among other applications. algorithms that run on a single machine and depend on the entire graph residing in memory are not scalable. which contains billions of pages. etc. and many others. Matching problems on such graphs can be used to model job seekers looking for employment or singles looking for dates.1 with an introduction to graph representations. average distance to other nodes. • Maximum flow. this task is useful for identifying communities in social networks (of interest to sociologists who wish to understand how human relationships form and evolve) and for partitioning large graphs (of interest to computer scientists who seek to better parallelize graph processing).2 This chapter is organized as follows: we begin in Section 5. a longer description is anticipated in a forthcoming paper [99] . Transportation companies (airlines. • Bipartite graph matching. Clearly. There are many ways to define what special means. epidemiologists modeling the spread of diseases.

1 provides an example of a simple directed graph (left) and its adjacency matrix representation (middle). In the case of graphs with weighted edges. such a representation is far from ideal for computer scientists concerned with efficient algorithmic implementations. Although mathematicians prefer the adjacency matrix representation of graphs for easy manipulation with linear algebra. each cell contains either a one (indicating an edge). where the number of actual edges is far smaller than the number of possible edges.1. A graph with n nodes can be represented as an n × n square matrix M . Section 5. 5. However.1 GRAPH REPRESENTATIONS One common way to represent graphs is with an adjacency matrix. there are n(n − 1) possible “friendships” (where n may be on the order of hundreds of millions). in a social network of n individuals. or a zero (indicating none). the diagonal remains empty. For graphs that allow self loops (a directed edge from a node to itself).1: A simple directed graph (left) represented as an adjacency matrix (middle) and with adjacency lists (right). .3 For example. even the most gregarious will have relatively few friends compared to the size of the network (thousands. otherwise. where a value in cell mij indicates an edge from node ni to node nj . where n is the number of vertices. n2. The same is true for the hyperlink structure of the web: each individual web page links to a minuscule portion of all the pages on the 3 Unfortunately.. Before concluding with a summary and pointing out additional readings. perhaps. there is no precise definition of sparseness agreed upon by all.4 discusses a number of general issue that affect graph processing with MapReduce. n4] [n3. cells above the diagonal). With undirected graphs.5. parallel breadth-first search (Section 5. GRAPH REPRESENTATIONS n2 n1 n1 n1 n2 n3 n5 n3 n4 n5 0 0 0 0 1 n2 1 0 0 0 1 n3 0 1 0 0 1 n4 1 0 1 0 0 n5 0 1 0 1 0 n1 n2 n3 n4 n5 [n2. n3] 93 n4 adjacency matrix adjacency lists Figure 5. the matrix cells contain edge weights. only half the matrix is used (e. Most of the applications discussed in the chapter introduction involve sparse graphs. but one common definition is that a sparse graph has O(n) edges.3).2) and PageRank (Section 5. n5] [n4] [n5] [n1. the diagonal might be populated.g. otherwise. Figure 5. but still far smaller than hundreds of millions).

operations on incoming links for each node translate into a column scan on the matrix. although we will return to this issue in Section 5. by definition. This property of the execution framework can also be used to invert the edges of a directed graph. However. with MapReduce? 4 This technique is used in anchor text inversion. if i < j.4.2 PARALLEL BREADTH-FIRST SEARCH One of the most common and well-studied problems in graph theory is the single-source shortest path problem. thus allowing us to compute over incoming edges with in the reducer. where the task is to find shortest paths from a source node to all other nodes in the graph (or alternatively. However. Note that certain graph operations are easier on adjacency matrices than on adjacency lists. most computational implementations of graph algorithms operate over adjacency lists. since n1 is connected by directed edges to n2 and n4 . this famous algorithm assumes sequential processing—how would we solve this problem in parallel. Alternatively. Figure 5. . For example. [107]). as we shall see. whereas operations on outgoing links for each node translate into a row scan. Such problems are a staple in undergraduate algorithm courses. most of the cells are zero. The major problem with an adjacency matrix representation for sparse graphs is its O(n2 ) space requirement. then nj is on the adjacency list of ni . the shuffle and sort mechanism in MapReduce provides an easy way to group edges by their destination nodes. but computing anything that requires knowledge of the incoming links of a node is difficult.g.. In the first. In this chapter. each appears on each other’s adjacency list). There are two options for encoding undirected graphs: one could simply encode each edge twice (if ni and nj are connected. those two nodes will be on the adjacency list of n1 . Furthermore. and more specifically. but not the other way around). where one gathers the anchor text of hyperlinks pointing to a particular page.4 5.1 also shows the adjacency list representation of the graph under consideration (on the right). by mapping over the nodes’ adjacency lists and emitting key–value pairs with the destination node id as the key and the source node id as the value. in which a node is associated with neighbors that can be reached via outgoing edges. As a result.94 CHAPTER 5.. With adjacency lists. where students are taught the solution using Dijkstra’s algorithm. edges can be associated with costs or weights. It is common practice to enrich a web page’s standard textual representation with all of the anchor text associated with its incoming hyperlinks (e. in which case the task is to compute lowest-cost or lowest-weight paths). GRAPH ALGORITHMS web.e. we assume processing of sparse graphs. it is natural to operate on outgoing links. one could order the nodes (arbitrarily or otherwise) and encode edges only on the adjacency list of the node that comes first in the ordering (i.

The actual paths can be recovered by storing “backpointers” for every node indicating a fragment of the shortest path. the algorithm “expands” that node by traversing the adjacency list of the selected node to see if any of those nodes can be reached with a path of a shorter distance. After the expansion. we see in . a global priority queue of vertices with priorities equal to their distance values d. w containing edge distances such that w(u.2. The algorithm terminates when the priority queue Q is empty. Dijkstra’s algorithm is shown in Figure 5. Note that the algorithm as presented in Figure 5. or equivalently. v ∈ V to ∞. A sample trace of the algorithm running on a simple graph is shown in Figure 5. The algorithm maintains Q. s) d[s] ← 0 for all vertex v ∈ V do d[v] ← ∞ Q ← {V } while Q = ∅ do u ← ExtractMin(Q) for all vertex v ∈ u. Nodes we have already considered for expansion are shown in black. we see in (b) that n3 is the next node selected for expansion. As a refresher and also to serve as a point of comparison. and Rivest’s classic algorithms textbook [41] (often simply known as CLR). PARALLEL BREADTH-FIRST SEARCH 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 95 Dijkstra(G. we see in (b) that n2 and n3 can be reached at a distance of 10 and 5. v) then d[v] ← d[u] + w(u. v) Figure 5. At each iteration. adapted from Cormen. the algorithm expands the node with the shortest distance and updates distances to all reachable nodes. Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance from the priority queue (initially.2: Pseudo-code for Dijkstra’s algorithm. Also. The algorithm begins by first setting distances to all vertices d[v]. Expanding n3 . w. v) ≥ 0. We start out in (a) with n1 having a distance of zero (since it’s the source) and all other nodes having a distance of ∞.5.2 only computes the shortest distances.AdjacencyList do if d[v] > d[u] + w(u. Leiserson. The input to the algorithm is a directed. this is the source node). except for the source node. In the first iteration (a). At each iteration. respectively. whose distance to itself is zero. when all nodes have been considered.3 (example also adapted from CLR).2. connected graph G = (V. n1 is selected as the node to expand (indicated by the thicker border). E) represented with adjacency lists. which is based on maintaining a global priority queue of nodes with priorities equal to their distances from the source node. and the source node s.

First. The nodes that will be expanded next. but we’ll relax this restriction later. Imagine water rippling away from a rock dropped into a pond— that’s a good image of how parallel breadth-first search works. Parts (a)–(e) show the running of the algorithm at each iteration. Nodes with thicker borders are those being expanded. are n5 . (c) that the distance to n2 has decreased because we’ve found a shorter path. in order. with the current distance inside the node. for example. what if there are multiple paths to the same node? Suppose we wish to compute the shortest distance . where we’ve discovered the shortest distance to all nodes. This is not possible in MapReduce. and so on. we adopt a brute force approach known as parallel breadth-first search. The algorithm terminates with the end state shown in (f).96 CHAPTER 5. the distance of all nodes directly connected to those is two. However. The key to Dijkstra’s algorithm is the priority queue that maintains a globallysorted list of nodes by current distance. n2 . hyperlinks on the web). and n4 . GRAPH ALGORITHMS n2 n4 1 n2 n4 1 n2 n4 1 ∞ 10 ∞ 10 9 4 6 10 0 n1 5 2 3 ∞ 10 9 4 7 6 8 0 n1 5 2 3 14 9 4 7 6 0 n1 5 2 3 7 ∞ n3 2 ∞ n5 5 n3 2 ∞ n5 5 n3 2 7 n5 (a) n2 n4 1 (b) n2 n4 1 (c) n2 n4 1 8 10 13 10 9 4 6 8 0 n1 5 2 3 9 10 9 4 7 6 8 0 n1 5 2 3 9 9 4 7 6 0 n1 5 2 3 7 5 n3 2 7 n5 5 n3 2 7 n5 5 n3 2 7 n5 (d) (e) (f) Figure 5. The intuition behind the algorithm is this: the distance of all nodes connected directly to the source node is one. nodes that have already been expanded are shown in black.3: Example of Dijkstra’s algorithm applied to a simple graph with five nodes. Instead. with n1 as the source and edge distances as indicated. This makes the algorithm easier to understand. as the programming model does not provide a mechanism for exchanging global data. as a simplification let us assume that all edges have unit distance (modeling.

and so on. with the node id as a key (Figure 5. . we “discover” all nodes that are connected to the source. and initialized to ∞ for all nodes except for the source node. After shuffle and sort. PARALLEL BREADTH-FIRST SEARCH 97 to node n. The second iteration. all nodes will be discovered with their shortest distances (assuming a fully-connected graph). directed graph represented as adjacency lists. or the greatest distance between any pair of nodes. Before we discuss termination of the algorithm. and. 5 Note that in this algorithm we are overloading the value type. eventually. Pseudo-code for the implementation of the parallel breadth-first search algorithm is provided in Figure 5. then parallel breadthfirst search on the global social network would take at most six MapReduce iterations. Distance to each node is directly stored alongside the adjacency list of that node. line 4 in the mapper). We need to “pass along” the graph structure from one iteration to the next. which can either be a distance (integer) or a complex data structure representing a node. The best way to achieve this in Hadoop is to create a wrapper object with an indicator variable specifying what the content is. we discover all nodes connected to those. In the reducer. Each iteration of the algorithm expands the “search frontier” by one hop. The key contains the node id of the neighbor. we must distinguish the node data structure from distance values (Figure 5. In the pseudo-code. there is one more detail required to make the parallel breadth-first search algorithm work.5.2.4. If this is indeed true. reducers will receive keys corresponding to the destination node ids and distances corresponding to all paths leading to that node.5 So how many iterations are necessary to compute the shortest distance to all nodes? The answer is the diameter of the graph. lines 5–6 in the reducer). then we must be able to reach all the nodes that are connected to n with distance d + 1.4.4. where each iteration corresponds to a MapReduce job. the node with the shortest distance. This is accomplished by emitting the node data structure itself. The first time we run the algorithm.). The final output is now ready to serve as input to the next iteration. It is apparent that parallel breadth-first search is an iterative algorithm. As with Dijkstra’s algorithm. etc. This number is surprisingly small for many real-world problems: the saying “six degrees of separation” suggests that everyone on the planet is connected to everyone else by at most six steps (the people a person knows are one step away. The shortest distance to n is the distance to ms plus one. and the value is the current distance to the node plus one. The algorithm works by mapping over all nodes and emitting a key-value pair for each neighbor on the node’s adjacency list. we assume a connected. people that they know are two steps away. The reducer will select the shortest of these distances and then update the distance in the node data structure. and update the minimum distance in the node data structure before emitting it as the final value. This says: if we can reach node n with a distance d. The shortest path must go through one of the nodes in M that contains an outgoing edge to n: we need to examine all m ∈ M to find ms . we use n to denote the node id (an integer) and N to denote the node’s corresponding data structure (adjacency list and current distance).

Each iteration (one MapReduce job) of the algorithm expands the “search frontier” by one hop. we iterate the algorithm until there are no more node distances that are ∞.AdjacencyList do Emit(nid m. . 152. there is not a shorter path that goes through a node that hasn’t been discovered).e. the driver program can access the final counter value and check to see if another iteration is necessary. or anything that the programmer desires. Hadoop provides a lightweight API for constructs called “counters”. which. For more serious academic studies of “small world” phenomena in networks. 2]. all discovered nodes are guaranteed to have the shortest distances (i.g. . Counters can be defined to count the number of nodes that have distances of ∞: at the end of the job. GRAPH ALGORITHMS 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: class Mapper method Map(nid n.Distance ← dmin Emit(nid m. . number of corrupt records.. .98 CHAPTER 5. and since all edge distances are one. number of times a certain condition is met. d + 1) Emit distances to reachable nodes class Reducer method Reduce(nid m.]) dmin ← ∞ M ←∅ for all d ∈ counts [d1 . node N ) d ← N.Distance Emit(nid n.4: Pseudo-code for parallel breath-first search in MapReduce: the mappers emit distances to reachable nodes. Since the graph is connected. d2 . while the reducers select the minimum of those distances for each destination node. Typically. e. [d1 . execution of an iterative MapReduce algorithm requires a nonMapReduce “driver” program.] do if IsNode(d) then M ←d else if d < dmin then dmin ← d M. The actual checking of the termination condition must occur outside of MapReduce. 62. . as the name suggests. . d2 . N ) Pass along graph structure for all nodeid m ∈ N. we refer the reader to a number of publications [61. repeats. which submits a MapReduce job to iterate the algorithm. can be used for counting events that occur during execution. checks to see if a termination condition has been met. . In practical terms. node M ) Recover graph structure Look for shorter distance Update shortest distance Figure 5. all nodes are reachable. and if not..

A simpler approach is to emit paths along with distances in the mapper. will work. Storing “backpointers” at each node. PARALLEL BREADTH-FIRST SEARCH search frontier 99 r s p q Figure 5.6 However. . as the search frontier expands. since for the most part paths (i.5: In the single source shortest path problem with arbitrary edge distances. as with Dijkstra’s algorithm in the form presented earlier. The adjacency lists.5. But how do we know that we’ve found the shortest distance to p? Well. must now encode the edge distances as well.2.4. we must now emit d + w where w is the edge distance. does not guarantee that we’ve discovered the shortest distance to r. we have been assuming that all edges are unit distance.. not the actual shortest paths. Up until now. Finally. we will discover the shortest distance to r. No other changes to the algorithm are required. since there may exist a path going through q that we haven’t encountered yet (because it lies outside the search frontier).5. but the termination behavior is very different. in which case we will not find the shortest distance to r until the search frontier expands to cover q. however. the path can be straightforwardly recovered. the shortest path from source s to node r may go outside the current search frontier. To illustrate. we just “discovered” node r for the very first time. This. and in this iteration. the parallel breadth-first search algorithm only finds the shortest distances. and that the shortest distance to r so far goes through p. we’ll eventually cover q and all other nodes along the path from p to q to r—which means that with sufficient iterations. but may not be efficient since the graph needs to be traversed again to reconstruct the path segments.e. Let us relax that restriction and see what changes are required in the parallel breadth-first search algorithm. sequence of node ids) are relatively short. which were previously lists of node ids. we would 6 Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside the search frontier since any such path would necessarily be longer. if the shortest path to p lies within the search frontier. Assume for the sake of argument that we’ve already discovered the shortest distance to node p. However. so that each node will have its shortest path easily accessible at all times. In line 6 of the mapper code in Figure 5. where s is the source node. as with Dijkstra’s algorithm. consider the graph fragment in Figure 5. instead of emitting d + 1 as the value. The additional space requirements for shuffling these data from mappers to reducers are relatively modest.

the algorithm is simply repeating previous computations. the driver program reads the counter value and determines if another iteration is necessary. we might need as many iterations as there are nodes in the graph minus one. Eight iterations are required to discover shortest distances to all nodes from n1 . parallel breadth-first search in MapReduce can be characterized as a brute force approach that “wastes” a lot of time performing computations whose results are discarded. we increment a counter. In practical terms. in which case. we’ll eventually discover all the shortest distances. such extreme cases are rare. the algorithm attempts to recompute distances to all nodes. we can repeat the same argument for all nodes on the path from s to p. Compared to Dijkstra’s algorithm on a single processor. how do we know when to stop iterating in the case of arbitrary edge distances? The algorithm can terminate when shortest distances at every node no longer change.6: A sample graph that elicits worst-case behavior for parallel breadth-first search. with n1 as the source. for most real-world graphs. The conclusion is that. the above argument applies. updated distances will propagate inside the search frontier.7 Outside the search frontier. Similarly. as in the unit edge distance case. and n5 until the fifth iteration.100 CHAPTER 5. The parallel breadth-first search algorithm would not discover that the shortest path from n1 to n6 goes through n3 . n4 . with sufficient iterations. Once again. have already discovered it. At each iteration. and the number of iterations necessary to discover all shortest distances is quite close to the diameter of the graph. GRAPH ALGORITHMS 1 n6 10 1 n7 n8 1 1 n1 1 1 1 n5 n9 n4 n2 1 n3 Figure 5. it is not difficult to construct graphs that will elicit this worse-case behavior: Figure 5. So exactly how many iterations does “eventually” mean? In the worst case. we can use counters to keep track of such events. Three more iterations are necessary to cover the rest of the graph.6 provides an example. Every time we encounter a shorter distance in the reducer. At the end of each MapReduce iteration. And if it doesn’t. . but in reality only useful work is done along the search frontier: inside the search frontier. In fact.5. the algorithm hasn’t 7 Unless the algorithm discovers an instance of the situation described in Figure 5. Fortunately.

In many cases. the graph itself is also passed from the mapper to the reducer. • Graph algorithms in MapReduce are generally iterative. those nodes on the adjacency lists). edge weights).. Every time a node is explored.e. However. we can think of this as “passing” the results of the computation along outgoing edges. In the reducer. The process is controlled by a non-MapReduce driver program that checks for termination. this is made possible by maintaining a global data structure (a priority queue) that holds nodes sorted by distance—this is not possible in MapReduce because the programming model does not provide support for global data that is mutable and accessible by the mappers and reducers. • In addition to computations. Conceptually. the data structure corresponding to each node is updated and written back to disk. In other words. the algorithm receives all partial results that have the same destination node. while the reducer computation is the Min function (selecting the shortest path). some form of aggregation). The parallel breadth-first search algorithm is instructive in that it represents the prototypical structure of a large class of graph algorithms in MapReduce. the MapReduce algorithm for PageRank works in much the same way. features of the nodes). intermediate output attached to each node. and features of the adjacency list (outgoing edges and their features). we’re guaranteed to have already found the shortest path to it. PARALLEL BREADTH-FIRST SEARCH 101 discovered any paths to nodes there yet.g. • The MapReduce algorithm maps over the node data structures and performs a computation that is a function of features of the node. They share in the following characteristics: • The graph structure is represented with adjacency lists. As we will see in the next section. Dijkstra’s algorithm.5. and performs another computation (usually.2. the mapper computation is the current distance plus edge distance (emitting distances to neighbors). These inefficiencies represent the cost of parallelization. which is part of some larger node data structure that may contain additional information (variables to store intermediate output. The results of these computations are emitted as values. features are attached to edges as well (e.. where the output of the previous iteration serves as input to the next iteration. In the reducer. computations can only involve a node’s internal state and its local graph structure. . For parallel breadth-first search. so no meaningful work is done. is far more efficient. keyed with the node ids of the neighbors (i. on the other hand.

randomly clicks a link on that page. But enough about random web surfing. the surfer clicks on a random link on the page as usual. and repeats ad infinitum. as well as nodes that are linked to by other nodes with high PageRank values.1) where |G| is the total number of nodes (pages) in the graph. α is the random jump factor. the surfer ignores the links on the page and randomly “jumps” or “teleports” to a completely different page. As it turns out. Since the PageRank value of m is the probability that the random surfer will be at m. two other popular algorithms. The complete formulation of PageRank includes an additional component. a biased coin is flipped—heads. the PageRank P of a page n is defined as follows: P (n) = α 1 |G| + (1 − α) P (m) C(m) m∈L(n) (5. More precisely. our web surfer doesn’t just randomly click links. (1 − α) is referred to as the “damping” factor. GRAPH ALGORITHMS 5. PageRank is a measure of how frequently a page would be encountered by our tireless web surfer. alternatively. the probability of arriving at n from m is P (m)/C(m). L(n). are SALSA [88] and HITS [84] (also known as “hubs and authorities”). PageRank is a probability distribution over nodes in the graph representing the likelihood that a random walk over the link structure will arrive at a particular node. Tails. not covered here. A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer visits a page. L(n) is the set of pages that link to n. This behavior makes intuitive sense: if PageRank is a measure of page quality. Formally. Let us consider a page m from the set of pages L(n): a random surfer at m will arrive at n with probability 1/C(m) since a link is selected at random from all outgoing links. and C(m) is the out-degree of node m (the number of links on page m). then the second page is likely to be high quality also. Let us break down each component of the formula in detail. Before the surfer decides where to go next. note that PageRank is defined recursively—this gives rise to an iterative algorithm we will detail in a bit. A web page n receives PageRank “contributions” from all pages that link to it. PageRank represents one particular approach to inferring the quality of a web page based on hyperlink structure. Nodes that have high in-degrees tend to have high PageRank values. Although it is only one of thousands of features that is taken into account in Google’s search algorithm. it is perhaps one of the best known and most studied. if a high-quality page links to another page. Similarly. First. The random jump factor α is sometimes called the “teleportation” factor. we would expect high-quality pages to contain “endorsements” from many other pages in the form of hyperlinks.3 PAGERANK PageRank [117] is a measure of web page quality based on the structure of the hyperlink graph. however.102 CHAPTER 5. To compute the .

However.e. and with probability 1 − α the random surfer follows a hyperlink. The same occurs for all the other nodes in the graph: note that n5 must split its PageRank mass three ways. PAGERANK 103 PageRank of n. For example. product. This is the summation in the second half of the equation. A simple example is a so-called “spider trap”.1 PageRank mass to n2 and 0. partial PageRank contributions are sent from each node to its neighbors connected via outgoing links.e.3. since it has three neighbors. it “passed along” more PageRank mass than it received. we need to sum contributions from all pages that link to n. or in some cases. Since PageRank is a probability distribution. The fact that PageRank is recursively defined translates into an iterative algorithm which is quite similar in basic structure to parallel breadth-first search. Note that PageRank assumes a community of honest users who are not trying to “game” the measure.1 each. we can think of this as spreading probability mass to neighbors via outgoing links. we ignore the random jump factor for now (i.. a infinite chain of pages (e.2. for example. where |G| is the number of nodes in the graph. generated by CGI) that all link to a single page (thereby artificially inflating its PageRank). the two contributions need to be combined: with probability α the random surfer executes a random jump. α = 0) and further assume that there are no dangling nodes (i. or service. The exact same . where an adversarial relationship exists between search engine companies and a host of other organizations and individuals (marketers. n1 sends 0. etc. As a simplification. from n3 .. Of course. This is. [12.g. i. each node sums up all PageRank contributions that have been passed to it and computes an updated PageRank score. Figure 5. 63]). We can think of this as gathering probability mass passed to a node via its incoming links. left).5..7 shows a toy example that illustrates two iterations of the algorithm.) who are trying to manipulate search results—to promote a cause.. The algorithm begins by initializing a uniform distribution of PageRank values across nodes. we also need to take into account the random jump: there is a 1/|G| chance of landing at any particular page. and n4 receives all the mass belonging to n3 because n3 isn’t connected to any other node. a node passes its PageRank contributions to other nodes that it is connected to. PageRank is only one of thousands of features used in ranking web pages. This makes sense in terms of the random surfer model: if the surfer is at n1 with a probability of 0. nodes with no outgoing edges). not true in the real world. At the beginning of each iteration. Note that since n1 has only one incoming link. To conclude the iteration. of course.e. In the beginning of the first iteration (top. activists. spammers. The end of the first iteration is shown in the top right: each node sums up PageRank contributions from its neighbors.1 PageRank mass to n4 . to trap and intentionally deceive users (see. its updated PageRank value is smaller than before. We start by presenting an informal sketch. This algorithm iterates until PageRank values don’t change anymore. then the surfer could end up either in n2 or n4 with a probability of 0. For this reason.

8.9 for the first iteration of the toy graph in Figure 5. The algorithm maps over the nodes.1 n2 (0.3) n3 (0.2) Figure 5. and the second iteration in our toy example is illustrated by the bottom two graphs.2 n4 (0. In the shuffle and sort phase.1 01 n5 (0.166) 0. and for each node computes how much PageRank mass needs to be distributed to its neighbors (i.2) 0.3) 0.1 0. top and bottom.066 0. the MapReduce execution framework groups values (piece of PageRank mass) passed along the graph edges by destination node (i. the PageRank values of all nodes sum to one.e.066 n5 (0. An illustration of the running algorithm is shown in Figure 5. keyed by the node ids of the neighbors. Conceptually.1) n2 (0.3) Iteration 2 n1 (0.3) 0.2 n3 (0.083 n2 (0. .1 01 0.133) 0. guaranteeing that we continue to have a valid probability distribution at the end of each iteration.e.033 0.183) n4 (0 2) (0. Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.166) 0. we can think of this as passing PageRank mass along outgoing edges. it is simplified in that we continue to ignore the random jump factor and assume no dangling nodes (complications that we will return to later).2) n5 (0. In the reducer. PageRank mass is preserved by the algorithm.383) n3 (0.104 CHAPTER 5. GRAPH ALGORITHMS Iteration 1 It ti n1 (0.033 0. PageRank mass contributions from all incoming edges are summed to arrive at the updated PageRank value for each node..083 n1 (0.1 0.1 0.3 n4 (0 3) (0.2) 0.7.2) 0.. Each piece of the PageRank mass is emitted as the value.166) n4 (0.166) n5 (0.7: PageRank toy example showing two iterations.166 n3 (0.066) n2 (0.2) 0. nodes on the adjacency list). process repeats.1 n1 (0. At the beginning of each iteration. Left graphs show PageRank values at the beginning of each iteration and how much PageRank mass is passed to each neighbor.066) 0. Right graphs show updated PageRank values at the end of each iteration.066 0. all edges that point to the same node).

p2 . .5.] do if IsNode(p) then M ←p else s←s+p M. . .]) M ←∅ for all p ∈ counts [p1 . . Each MapReduce job corresponds to one iteration of the algorithm.AdjacencyList| Emit(nid n.3. .PageRank/|N. node M ) Recover graph structure Sum incoming PageRank contributions Figure 5.PageRank ← s Emit(nid m. node N ) p ← N. p) Pass PageRank mass to neighbors class Reducer method Reduce(nid m. In the map phase we evenly divide up each node’s PageRank mass and pass each piece along outgoing edges to neighbors. In the reduce phase PageRank contributions are summed up at each destination node. [p1 . . . p2 . N ) Pass along graph structure for all nodeid m ∈ N. PAGERANK 105 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: class Mapper method Map(nid n.AdjacencyList do Emit(nid m.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the random jump factor).

1 n2 (0. PageRank mass is distributed evenly to nodes on each node’s adjacency list (shown at the very top).2 0. these might correspond to pages in a crawl that have not been downloaded yet. In the hyperlink graph of the web. we can access the counter to find out how much PageRank . In the reduce phase. i. let us now take into account the random jump factor and dangling nodes: as it turns out both are treated similarly.1 n1 (0. Dangling nodes are nodes in the graph that have no outgoing edges. n5] n3 [n4] n4 [n5] n5 [n1.8 on graphs with dangling nodes. n4] n2 [n3. n5] n3 [n4] n4 [n5] n5 [n1. [22]).2) CHAPTER 5.166) 106 n4 (0.3) n3 (0 166) (0.066 0. n2. During the map phase. GRAPH ALGORITHMS n4 (0. The size of each box is proportion to its PageRank value. At the end of the iteration.7. the graph structure itself must be passed from iteration to iteration. One simple approach is by instrumenting the algorithm in Figure 5.2) 0.166) 0. n4] n2 [n3.e.8 with counters: whenever the mapper processes a node with an empty adjacency list. the sum of all the updated PageRank values should remain a valid probability distribution. Intermediate values are keyed by node (shown inside the boxes).1 ( ) 0. n3] Map n2 n4 n3 n5 n4 n5 n1 n2 n3 n1 n2 n2 n3 n3 n4 n4 n5 n5 Reduce n1 [n2. Having discussed the simplified PageRank algorithm in MapReduce. Each node data structure is emitted in the mapper and written back out to disk in the reducer. since no key-value pairs will be emitted when a dangling node is encountered in the mappers. the total PageRank mass will not be conserved.2) 0. All PageRank mass emitted by the mappers are accounted for in the reducer: since we begin with the sum of PageRank values across all nodes equal to one. If we simply run the algorithm in Figure 5.. it keeps track of the node’s PageRank value in the counter.066 0.066 n5 (0. n3] Figure 5. their adjacency lists are empty.2 n3 (0 2) (0. As with the parallel breadth-first search algorithm.Iteration 1 n1 (0.3) n1 [n2.9: Illustration of the MapReduce PageRank algorithm corresponding to the first iteration in Figure 5. n2. The proper treatment of PageRank mass “lost” at the dangling nodes is to redistribute it across all nodes in the graph evenly (cf.2) 0.066) ( ) n2 (0. all partial PageRank contributions are summed together to arrive at updated values.2) n5 (0. There are many ways to determine the missing PageRank mass.1 0.

its PageRank mass is emitted with the special key. At the same time. Either way. Finally..e. to take into account.5. which ensures a valid probability distribution. i. PAGERANK 8 107 mass was lost at the dangling nodes. the PageRank values of all nodes sum up to one. and then cast as an integer. PageRank is iterated until convergence. Alternative stopping criteria include running a fixed number of iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of PageRank values no longer change.3. the PageRank driver program must check to see if convergence has been reached. Typically. Rank stability is obtained faster than the actual convergence of values. . counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large constant. we end up with exactly the same data structure as the beginning. its current PageRank value p is updated to the final PageRank value p according to the following formula: p =α 1 |G| + (1 − α) m +p |G| (5. and |G| is the number of nodes in the entire graph. Another approach is to reserve a special key for storing PageRank mass from dangling nodes. the reducer must be modified to contain special logic for handling the missing PageRank mass. and the second to take care of dangling nodes and the random jump factor. Also. We add the PageRank mass from link traversal (p. When the mapper encounters a dangling node. Putting everything together.2) where m is the missing PageRank mass. at the end of each iteration. we arrive at the amount of PageRank mass lost at the dangling nodes—this then must be redistribute evenly across all nodes. one iteration of PageRank requires two MapReduce jobs: the first to distribute PageRank mass along graph edges. floating point precision errors). Yet another approach is to write out the missing PageRank mass as “side data” for each map task (using the in-mapper combining technique for aggregation). Note that this MapReduce job requires no reducers. we take into account the random jump factor: with probability α the random surfer arrives via jumping. computed from before) to the share of the lost PageRank mass that is distributed to each node (m/|G|). The latter is useful for some applications that only care about comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. for example. At end of each iteration. 8 In Hadoop. and with probability 1 − α the random surfer arrives via incoming links. when the PageRank values of nodes no longer change (within some tolerance. For each node. which is a requirement for the iterative algorithm to work. Therefore. a final pass in the driver program is needed to sum the mass across all map tasks. we can take into account the random jump factor. This redistribution process can be accomplished by mapping over all nodes again.

convergence on the global graph is possible. the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb) provide great starting points into the literature. which in the worst case is O(n2 ) in the number of nodes in the graph. random access. The amount of intermediate data generated is on the order of the number of edges. Dijkstra’s algorithm uses a global priority queue that guides the expansion of nodes. where there would exist a link from node ni to nj if nj was reachable within two link traversals of ni in the original graph G. [22] for additional discussion). For example. MapReduce running time would be dominated by copying intermediate data across the network.108 CHAPTER 5. but generally. 10 Of course. it is usually possible to maintain global data structures in memory for fast.9 5. Since MapReduce clusters are the interested reader. The passing of partial results along a graph edge is accomplished by the shuffling and sorting provided by the MapReduce execution framework. is not possible with MapReduce—the programming model does not provide any built-in mechanism for communicating global state. Of course. communication can only occur from a node to the nodes it links to. it is perfectly reasonable to compute derived graph structures in a pre-processing step. As a result. GRAPH ALGORITHMS In absolute terms. of course. In the original PageRank paper [117]. For dense graphs. For example. for obvious reasons)—but undoubtedly these algorithmic modifications impact convergence behavior. the results of which are “passed” to its neighbors. which explains why all the algorithms we have discussed assume sparse graphs. 9 For . if one wishes to propagate information from a node to all nodes that are within two links. the answer is not very meaningful due to the adversarial nature of web search as previously discussed—the web is full of spam and populated with sites that are actively trying to “game” PageRank and related hyperlink-based metrics. one could process graph G to derive graph G . On today’s web.10 This restriction gives rise to the structure of many graph algorithms in MapReduce: local computation is performed on each node. Since the most natural representation of large sparse graphs is with adjacency lists.4 ISSUES WITH GRAPH PROCESSING The biggest difference between MapReduce graph algorithms and single-machine graph algorithms is that with the latter. A full discussion of the escalating “arms race” between search engine companies and those that seek to promote their sites is beyond the scope of this book. passing information is only possible within the local graph structure. strategies developed by web search companies to combat link spam are proprietary (and closelyguarded secrets. fewer than one might expect. With multiple iterations. or to a node from nodes linked to it—in other words. running PageRank in its unmodified form presented here would yield unexpected and undesirable results. This. how many iterations are necessary for PageRank to converge? This is a difficult question to precisely answer since it depends on many factors. convergence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini et al.

However. when summing PageRank contribution for a node. Resorting records using MapReduce is both easy to do and a relatively cheap operation—however. we might sort nodes representing users by zip code. It is straightforward to use combiners for both parallel breadth-first search and PageRank since Min and sum. Finally. the product of two values is simply their sum. Another good example is to partition the web graph by the language of the page (since pages in one language tend to link mostly to other pages in that language) or by domain name (since inter-domain links are typically much denser than intra-domain links). for example. this sometimes creates a chick-and-egg problem. ISSUES WITH GRAPH PROCESSING 109 designed around commodity networks (e. as opposed to by last name—based on the observation that friends tend to live close to each other. A very common solution to this problem is to represent probabilities using their logarithms. It would be desirable to partition a large graph to facilitate efficient processing by MapReduce. However. in many cases there are simple solutions around this problem in the form of “cheap” partitioning heuristics based on reordering the data [106]. When probabilities are stored as logs. used in the two algorithms. are both associative and commutative.. Sorting by an even more cohesive property such as school would be even better (if available): the probability of any two random students from the same school knowing each other is much higher than two random students from different schools. combiners are only effective to the extent that there are opportunities for partial aggregation—unless there are nodes pointed to by multiple nodes being processed by an individual map task. This implies that it would be desirable to partition large graphs into smaller components where there are many intracomponent links and fewer inter-component links.g. in a social network. This way.5. there is a practical consideration to keep in mind when implementing graph algorithms that estimate probability distributions over nodes (such as PageRank). For example. respectively. dense graphs. Combiners and the in-mapper combining pattern described in Section 3.4. This can be implemented with reasonable precision as follows: . But the graph may be so large that we can’t partition it except with MapReduce algorithms! Fortunately. gigabit Ethernet). Unfortunately. For large graphs.1 can be used to decrease the running time of graph iterations. addition of probabilities is also necessary. the probability of any particular node is often so small that it underflows standard floating point representations. MapReduce algorithms are often impractical on large. we can arrange the data such that nodes in the same component are handled by the same map task—thus maximizing opportunities for combiners to perform local aggregation. whether the efficiencies gained by this crude form of partitioning are worth the extra time taken in performing the resort is an empirical question that will depend on the actual graph structure and algorithm. combiners are not very useful.

a distributed database). Schatz [132] tackled the problem of DNA sequence 11 However. discussing in detail parallel breadth-first search and PageRank. • The graph structure itself is passed from the mapper to the reducer. • Algorithms are iterative and under the control of a non-MapReduce driver program. academia. Additional Readings. • Algorithms map over nodes and pass partial results to nodes on their adjacency lists. which checks for termination at the end of each iteration. maintaining globally-synchronized state may be possible with the assistance of other tools (e. Instead. on graphs derived from textual data. information typically propagates along graph edges—which gives rise to the structure of algorithms discussed above.110 CHAPTER 5. There is. [80] presented an approach to estimating the diameter of large graphs using MapReduce and a library for graph mining [81]. b + log(1 + ea−b ) a < b a + log(1 + eb−a ) a ≥ b 5.g. Both are instances of a large class of iterative algorithms that share the following characteristics: • The graph structure is represented with adjacency lists. much beyond what has been covered in this chapter. The MapReduce programming model does not provide a mechanism to maintain global data structures accessible and mutable by all the mappers and reducers. Partial results are aggregated for each node in the reducer.. . The ubiquity of large graphs translates into substantial interest in scalable graph algorithms using MapReduce in industry. we refer readers to the following: Kang et al. a standard algorithm for semi-supervised machine learning. of course. such that the output is in the same form as the input. with social network analysis in mind. and beyond.5 SUMMARY AND ADDITIONAL READINGS This chapter covers graph algorithms in MapReduce. Its use may further improve the accuracy of implementations that use log probabilities.11 One implication of this is that communication between pairs of arbitrary nodes is difficult to accomplish. For additional material. many math libraries include a log1p function which computes log(1 + x) with higher precision than the na¨ implementation would have when x is very small ıve (as is often the case when working with probabilities). GRAPH ALGORITHMS a⊕b= Furthermore. Rao and Yarowsky [128] described an implementation of label propagation. Cohen [39] discussed a number of algorithms for processing undirected graphs.

60]. particular in the PRAM model [77. Finally. It is not clear.5.5. to what extent well-known PRAM algorithms translate naturally into the MapReduce framework. it is easy to forget that parallel graph algorithms have been studied by computer scientists for several decades. SUMMARY AND ADDITIONAL READINGS 111 alignment and assembly with graph algorithms in MapReduce. . however.

These advantages come at the cost of systems that often behave internally quite differently than a human- . sentences or emails) can relate to any instance of the kinds of outputs that the final system should produce (e.. They are brittle with respect to the natural variation found in language. The basic strategy of the data-driven approach is to start with a processing algorithm capable of capturing how any instance of the kinds of inputs (e. they often do so catastrophically. In the last 20 years. and. and transform text input. usually in a deterministic way. The learning process. they tend to deal with the complexities found in real data more robustly than rule-based systems do.g. where the “rules” for processing the input are inferred automatically from large corpora of examples... Second. an active area of research. unable to offer even a “best guess” as to what the desired analysis of the input might be. rule-based systems suffer from a number of serious problems.g. However. typically consists of activities like ranking rules. when errors surface. developing training data tends to be far less expensive than developing rules. At this stage. Next. but they are not distinguished in any way. Data-driven approaches have turned out to have several benefits over rule-based approaches to system development. instantiating the content of rule templates.112 CHAPTER 6 EM Algorithms for Text Processing Until the end of the 1980s. they can be corrected by writing new rules or refining old ones. text processing systems tended to rely on large numbers of manually written rules to analyze. called training data. and developing systems that can deal with inputs from diverse domains is very labor intensive.g. For some applications. This rule-based approach can be appealing: a system’s behavior can generally be understood and predicted precisely. the syntactic structure of the sentence or a classification of the email as spam). Since data-driven systems can be trained using examples of the kind that they will eventually be used to process. the system can be thought of as having the potential to produce any output for any input. a learning algorithm is applied which refines this process based on the training data—generally attempting to make the model perform as well as possible at predicting the examples in the training data. which often involves iterative algorithms. significant quantities of training data may even exist for independent reasons (e. This is known as machine learning. Furthermore. when these systems fail. the rule-based approach has largely been abandoned in favor of more data-driven methods. annotate. not because they are generating training data for a data-driven machine translation system). or determining parameter settings for a given model. translations of text into multiple languages are created by authors wishing to reach an audience speaking different languages.

which are always observable.g. This model may take the form of either a joint model Pr(x. The second challenge is parameter estimation or learning. called expectation maximization (EM). where the model space is infinite. As an example of such models. we introduce hidden Markov models (HMMs). which says to select the parameters that make the training data most probable under the model. For a problem where X and Y are very small. which define a joint distribution over sequences of inputs and sequences of annotations. As a result.113 engineered system. etc. to annotations from a set Y. for something like email classification or machine translation. The final challenge for statistical modeling is the problem of decoding. y ∈ X × Y or a conditional model Pr(y|x). and one learning algorithm that attempts to meet this criterion. One very common strategy is to select y according to the following criterion: y ∗ = arg max Pr(y|x) y∈Y this chapter. and must be computed algorithmically. which have an infinite number of parameters and have become important statistical models for text processing in recent years. given an x ∈ X . However. maximum likelihood estimation. For example. The first is model selection.1 There are three closely related. This entails selecting a representation of a joint or conditional distribution over the desired X and Y. correcting errors that the trained system makes can be quite challenging. 1 In .2 The parameters of a statistical model are the values used to compute the probability of some event described by the model. y) which assigns a probability to every pair x. which involves the application of a optimization algorithm and training criterion to select the parameters of the model to optimize the model’s performance (with respect to the given training criterion) on the training data. In this chapter we will focus on one particularly simple training criterion for parameter estimation. but distinct challenges in statistical textprocessing. is beyond the scope of this chapter. and their presentation is simpler than models with continuous densities. we might have Y = {Spam. using the model to select an annotation y.). Inference in and learning of so-called nonparameteric models. sentences. 2 We restrict our discussion in this chapter to models with finite numbers of parameters and where the learning process refers to setting those parameters. to create a statistical spam detection system. the probabilities cannot be represented directly. For machine translation. one could imagine representing these probabilities in look-up tables. Data-driven information processing systems can be constructed using a variety of mathematical techniques. which assigns a probability to every y ∈ Y. which probabilistically relate inputs from an input set X (e.. documents. given some x. NotSpam} and X be the set of all possible email messages. or. but in this chapter we focus on statistical models. It should be kept in mind that the sets X and Y may still be countably infinite. we will consider discrete models only. which is the space of possible annotations or analyses that the system should predict. X might be the set of Arabic sentences and Y the set of English sentences. They tend to be sufficient for text processing.

and are used to solve a number of problems of interest in text processing... Unsupervised learning. In many cases. meaning that each xi must be manually labeled by a trained individual. where xi ∈ X . on the other hand.e. y) = arg max Pr(x. where xi . Supervised learning of statistical models simply means that the model parameters are estimated from training data consisting of pairs of inputs and annotations.114 CHAPTER 6. but without any example annotations. y2 . x2 . the search is also straightforward. the yi ’s are unobserved).g. data where some of the variables in the model cannot be observed. xi to be paired with its correct label. these gold standard training labels must be generated by a process of expert annotation. In a joint model. since the training data requires each object that is to be classified (to pick a specific task). in the case of spam detection). Machine learning is often categorized as either supervised or unsupervised. x2 . y) from incomplete data (i. but we will touch on the third problem as well. y ) The specific form that the search takes will depend on how the model is represented. EM ALGORITHMS FOR TEXT PROCESSING In a conditional (or direct) model. . they are often uneconomical to use.. . which may of course be infinite in size) will far exceed the amount of data that can be annotated. . in the case of unsupervised learning. yi . we focus on a particular class of algorithms—expectation maximization (EM) algorithms—that can be used to learn the parameters of a joint model Pr(x. on account of the definition of conditional probability: y ∗ = arg max Pr(y|x) = arg max y∈Y y∈Y Pr(x. Even when the annotation task is quite simple for people to carry out (e. While supervised models often attain quite good performance.. correct) annotation of xi . yi ∈ X × Y and yi is the gold standard (i. when predicting more complex structures such as sequences of labels or when the annotation task requires specialized expertise). As the annotation task becomes more complicated (e. that is Z = x1 . these algorithms can be quite computationally expen- .e. . requires only that the training data consist of a representative collection of objects that should be annotated. the number of potential examples that could be classified (representing a subset of X . y) y∈Y y Pr(x. Expectation maximization algorithms fit naturally into the MapReduce paradigm.g. . Furthermore. y1 . Our focus in this chapter will primarily be on the second problem: learning parameters for models. this is a straightforward search for the best y under the model. that is Z = x1 . While it may at first seem counterintuitive that meaningful annotations can be learned without any examples of the desired annotations being given. . the learning criteria and model structure (which crucially define the space of possible annotations Y and the process by which annotations relate to observable inputs) make it possible to induce annotations by relying on regularities in the unclassified training instances. annotation becomes far more challenging. While a thorough discussion of unsupervised learning is beyond the scope of this book.

show how this is generalized to models where not all variables are observable. and then introduce expectation maximization (EM).1.4 we look at a case study of word alignment for statistical machine translation.2.1 MAXIMUM LIKELIHOOD ESTIMATION Maximum likelihood estimation (MLE) is a criterion for fitting the parameters θ of a statistical model to some given data x. since they generally require repeated evaluations of the training data.5 examines similar algorithms that are appropriate for supervised learning tasks. We describe hidden Markov models (HMMs) in Section 6. Section 6.1. In Section 6.1) To illustrate.3 discusses how EM algorithms can be expressed in MapReduce. EXPECTATION MAXIMIZATION 115 sive. Section 6. θ) θ (6. we describe maximum likelihood estimation for statistical models.1. as well as for numerous other clustering and structure discovery problems. consider the simple marble game shown in Figure 6. being diverted to the left or right by the peg (indicated by a triangle) in the center. This chapter concludes with a summary and pointers to additional readings. but also to improve efficiency bottlenecks at scales where non-parallel solutions could be utilized. and it bounces down into one of the cups at the bottom of the board. 6. constituency and dependency trees. a very versatile class of models that uses EM for parameter estimation. it says to select the parameter settings θ∗ such that the likelihood of observing the training data given the model is maximized: θ∗ = arg max Pr(X = x. EM algorithms have been used to find part-of-speech sequences. A “rule-based” approach might be to take .1 EXPECTATION MAXIMIZATION Expectation maximization (EM) algorithms [49] are a family of iterative optimization algorithms for learning probability distributions from incomplete data. and then in Section 6. Specifically. those characterizing the latent structure that is sought). Expectation maximization generalizes the principle of maximum likelihood estimation to the case where the values of some variables are unobserved (specifically. MapReduce therefore provides an opportunity not only to scale to larger amounts of data. alignments between acoustic signals and their transcriptions. In this game.6.1. To name just a few applications. alignments between texts in different languages. This chapter is organized as follows. a marble is released at the position indicated by the black dot. Our task is to construct a model that predicts which cup the ball will drop into. 6. They are extensively used in statistical natural language processing where one seeks to infer latent linguistic structure from unannotated text.

might be to assume that the behavior of the marble in this game can be modeled using a Bernoulli random variable Y with parameter p.116 CHAPTER 6.i. so an observation of X is equivalent to an observation of Y .a) (1 − p)δ(xj . a b c exact measurements of the board and construct a physical model that we can use to predict the behavior of the ball. b. so we drop 10 marbles into the game which end up in cups x = b. this could certainly lead to a very accurate model. b. δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0 otherwise. b. the value of the random variable indicates whether path 0 or 1 is taken. . which indicates the probability that the marble will go to the right when it hits the peg. We also define a random variable X whose value is the label of the cup that the marble ends up in.). p) = j=1 pδ(xj . A statistical approach. the construction of this model would be quite time consuming and difficult. This game can be modeled using a Bernoulli random variable with parameter p. b. EM ALGORITHMS FOR TEXT PROCESSING 0 1 a b 0 1 2 3 Figure 6. Given sophisticated enough measurements.d. a . a. b. b. maximizing log Pr(x. To estimate the parameter p of the statistical model of our game. we need some training data. on the other hand.b) = p2 · (1 − p)8 Since log is a monotonically increasing function. However. We can do this differentiating with respect to p and finding where 3 In this equation. we can write the likelihood of our data as follows:3 10 Pr(x. p) will give us the desired result. note that X is deterministically related to Y . That is. b. What is the maximum likelihood estimate of p given this data? By assuming that our samples are independent and identically distributed (i.1: A simple marble game where a released marble takes one of two possible paths.

2.1. taking on values from {0. it is straightforward to show that in N trials where N0 marbles followed path 0 to cup a. and right pegs.2. Additionally. b. We further define a random variable X taking on values from {a. astonishingly simple models can outperform complex knowledge-intensive models that attempt to simulate complicated processes. 3} indicating which path was taken. While this model only makes use of an approximation of the true physical process at work when the marble interacts with the game board. consider a more complicated variant of our marble game shown in Figure 6. which is the intuitive result. we again assume that the behavior of a marble interacting with a peg can be modeled with a Bernoulli random variable. 1. if insufficient resources were invested in building a physical model. 6. left. we have three random variables with parameters θ = p0 . This sort of dynamic is found often in text processing problems: given enough data. and the marble may end up in one of three cups. corresponding to the probabilities that the marble will go to the right at the top. c} indicating what cup the marble ends in. the maximum likelihood estimate of p is N1 /(N0 + N1 ). while a Bernoulli trial is an extreme approximation of the physical process. and N1 marbles followed path 1 to cup b. p1 . it is an empirical question whether the model works well enough in practice to be useful. p) dp d[2 · log p + 8 · log(1 − p)] dp 2 8 − p 1−p = 0 = 0 = 0 Solving for p yields 0.2 A LATENT VARIABLE MARBLE GAME To see where latent variables might come into play in modeling. Note that the full joint distribution Pr(X = x.1. How should the parameters θ be estimated? If it were possible to observe the paths taken by marbles as they were dropped into the game. and Y . p2 . EXPECTATION MAXIMIZATION 117 the resulting expression equals 0: d log Pr(x. If Nx counts the number of times a marble took path x in N trials. it would be trivial to estimate the parameters for our model using the maximum likelihood estimator—we would simply need to count the number of times the marble bounced left or right at each peg.6. This version consists of three pegs that influence the marble’s path. Note that both paths 1 and 2 lead to cup b. Furthermore.2. this is: . the approximation may perform better than the more complicated “rule-based” model. Since there are three pegs. Y = y) is determined by θ. To construct a statistical model of this game.

118

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

0

1

a

b
0

1

2

3

a

b

c

Figure 6.2: A more complicated marble game where the released marble takes one of four possible paths. We assume that we can only observe which cup the marble ends up in, not the specific path taken.

p0 =

N2 + N3 N

p1 =

N1 N0 + N1

p2 =

N3 N2 + N3

However, we wish to consider the case where the paths taken are unobservable (imagine an opaque sheet covering the center of the game board), but where we can see what cup a marble ends in. In other words, we want to consider the case where we have partial data. This is exactly the problem encountered in unsupervised learning: there is a statistical model describing the relationship between two sets of variables (X’s and Y ’s), and there is data available from just one of them. Furthermore, such algorithms are quite useful in text processing, where latent variables may describe latent linguistic structures of the observed variables, such as parse trees or part-of-speech tags, or alignment structures relating sets of observed variables (see Section 6.4). 6.1.3 MLE WITH LATENT VARIABLES Formally, we consider the problem of estimating parameters for statistical models of the form Pr(X, Y ; θ) which describe not only an observable variable X but a latent, or hidden, variable Y . In these models, since only the values of the random variable X are observable, we define our optimization criterion to be the maximization of the marginal likelihood, that is, summing over all settings of the latent variable Y , which takes on values from set designated Y:4 Again, we assume that samples in the training data x are i.i.d.:
4 For

this description, we assume that the variables in our model take on discrete values. Not only does this simplify exposition, but discrete models are widely used in text processing.

6.1. EXPECTATION MAXIMIZATION

119

Pr(X = x) =
y∈Y

Pr(X = x, Y = y; θ)

For a vector of training observations x = x1 , x2 , . . . , x , if we assume the samples are i.i.d.: Pr(x; θ) =
|x| j=1 y∈Y

Pr(X = xj , Y = y; θ)

Thus, the maximum (marginal) likelihood estimate of the model parameters θ∗ given a vector of i.i.d. observations x becomes: θ = arg max
∗ θ |x| j=1 y∈Y

Pr(X = xj , Y = y; θ)

Unfortunately, in many cases, this maximum cannot be computed analytically, but the iterative hill-climbing approach of expectation maximization can be used instead. 6.1.4 EXPECTATION MAXIMIZATION Expectation maximization (EM) is an iterative algorithm that finds a successive series of parameter estimates θ(0) , θ(1) , . . . that improve the marginal likelihood of the training data. That is, EM guarantees:
|x| j=1 y∈Y

Pr(X = xj , Y = y; θ(i+1) ) ≥

|x| j=1 y∈Y

Pr(X = xj , Y = y; θ(i) )

The algorithm starts with some initial set of parameters θ(0) and then updates them using two steps: expectation (E-step), which computes the posterior distribution over the latent variables given the observable data x and a set of parameters θ(i) ,5 and maximization (M-step), which computes new parameters θ(i+1) maximizing the expected log likelihood of the joint distribution with respect to the distribution computed in the E-step. The process then repeats with these new parameters. The algorithm terminates when the likelihood remains unchanged.6 In more detail, the steps are as follows:
term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x; θ(i) ) that are required to solve the M-step have the form of an expectation (with respect to this distribution). 6 The final solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be the global maximum.
5 The

120

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

E-step. Compute the posterior probability of each possible hidden variable assignments y ∈ Y for each x ∈ X and the current parameter settings, weighted by the relative frequency with which x occurs in x. Call this q(X = x, Y = y; θ(i) ) and note that it defines a joint probability distribution over X × Y in that (x,y)∈X ×Y q(x, y) = 1. q(x, y; θ(i) ) = f (x|x) · Pr(Y = y|X = x; θ(i) ) = f (x|x) · Pr(x, y; θ(i) ) (i) y Pr(x, y ; θ )

M-step. Compute new parameter settings that maximize the expected log of the probability of the joint distribution under the q-distribution that was computed in the E-step: θ(i+1) = arg max Eq(X=x,Y =y;θ(i) ) log Pr(X = x, Y = y; θ )
θ

= arg max
θ (x,y)∈X ×Y

q(X = x, Y = y; θ(i) ) · log Pr(X = x, Y = y; θ )

We omit the proof that the model with parameters θ(i+1) will have equal or greater marginal likelihood on the training data than the model with parameters θ(i) , but this is provably true [78]. Before continuing, we note that the effective application of expectation maximization requires that both the E-step and the M-step consist of tractable computations. Specifically, summing over the space of hidden variable assignments must not be intractable. Depending on the independence assumptions made in the model, this may be achieved through techniques such as dynamic programming. However, some models may require intractable computations. 6.1.5 AN EM EXAMPLE Let’s look at how to estimate the parameters from our latent variable marble game from Section 6.1.2 using EM. We assume training data x consisting of N = |x| observations of X with Na , Nb , and Nc indicating the number of marbles ending in cups a, b, and c. We (0) (0) (0) start with some parameters θ(0) = p0 , p1 , p2 that have been randomly initialized to values between 0 and 1. E-step. We need to compute the distribution q(X = x, Y = y; θ(i) ), as defined above. We first note that the relative frequency f (x|x) is: Nx N

f (x|x) =

EM requires several iterations to converge. Y = y. The resulting optimal values are expressed in terms of the counts in x and θ(i) : Pr(2|b. Pr(Y = y|X = x) is zero for all other values of x and y (regardless of the value of the parameters). θ p0 = p1 = p2 = Nb + Nc It is worth noting that the form of these expressions is quite similar to the fully observed maximum likelihood estimate. we turn to hidden Markov models (HMMs). given x and parameters θ(i) . And since EM only finds a locally optimal solution. θ(i) ) = (1 − p0 )p1 + p0 (1 − p2 ) (i) Pr(2|b. M-step. we observe that Pr(Y = 0|X = a) = 1 and Pr(Y = 3|X = c) = 1 since cups a and c fully determine the value of the path variable Y . HIDDEN MARKOV MODELS 121 Next.6. θ(i) ) Nc /N log Pr(X = x. the example we have presented here is quite simple and the model converges to a global optimum after a single iteration. The posterior probability of paths 1 and 2 are only non-zero when X is b: (1 − p0 )p1 (i) (i) (i) (i) (i) Pr(1|b. θ(i) ) · Nb Nc (i) ) · Pr(2|b. Typically. HMMs are models of . Y = y. rather than depending on exact path counts. θ ) log(1 − p0 ) + log(1 − p1 ) log(1 − p0 ) + log p1 log p0 + log(1 − p2 ) log p0 + log p2 Multiplying across each row and adding from top to bottom yields the expectation we wish to maximize. Y . However. and it may not find a global optimum. However. θ(i) ) Nb /N · Pr(2|b.2. The non-zero terms in the expectation are as follows: x a b b c y 0 1 2 3 q(X = x. θ(i) ) · Nb Na + Pr(1|b. the final parameter values depend on the values chose for θ(0) . Each parameter can be optimized independently using differentiation. θ(i) ) · Nb + Nc N Pr(1|b. the statistics used are the expected path counts. θ ) (which will be a function in terms of the three parameter variables) under the q-distribution we computed in the E step. θ(i) ) = (1 − p0 )p1 + p0 (1 − p2 ) (i) p0 (1 − p2 ) (i) (i) (i) (i) (i) Except for the four cases just described. the values computed at the end of the M-step would serve as new parameters for another iteration of EM. We now need to maximize the expectation of log Pr(X. For most models. 6.2 HIDDEN MARKOV MODELS To give a more substantial and useful example of models whose parameters may be estimated using EM. θ(i) ) Na /N Nb /N · Pr(1|b.

e. at each time step. As another point of comparison. base pair. and s refer to states in S.). Bq (o) ≥ 0. the PageRank algorithm considered in the previous chapter (Section 5. as well as that: Aq (r) = 1 ∀q Bq (o) = 1 ∀q πq = 1 q∈S r∈S o∈O A sequence of observations of length τ is generated as follows: Step 0. or letter) is emitted according to a probability distribution conditioned on the identity of the state that the underlying process is in. and o refers to symbols in the observation vocabulary O. we assume that variables q. a word. and o. part of speech tagging [44].122 CHAPTER 6. θ .7 These matrices may be dense. These simple but powerful models have been used in applications as diverse as speech recognition [78]. text retrieval [108]. . but it is one that is useful for many text processing problems. not directly observable (i. Following convention. The states of this Markov process are..3) can be understood as a Markov process: the probability of following any link on a particular page is independent of the path taken to reach that page.g. where Aq (r) gives the probability of transitioning from state q to state r. let t = 1 and select an initial state q according to the distribution π. O. which is a stochastic process consisting of a finite set of states where the probability of entering a state at time t + 1 depends only on the state of the process at time t [130]. base pairs in a gene. an |S| × |O| matrix B of emission probabilities. or continuous-valued observations may be emitted (for example. hidden). but for many applications sparse parameterizations are useful. 7 This is only one possible definition of an HMM. from left to right. where Bq (o) gives the probability that symbol o will be emitted from state q. an observable token (e. which generate symbols from a finite observation vocabulary O.. and an |S|-dimensional vector π. and πq ≥ 0 for all q. one can view a Markov process as a probabilistic variant of a finite state machine. where πq is the probability that the process starts in state q. the data being modeled is posited to have been generated from an underlying Markov process. initial and final states may be handled differently. EM ALGORITHMS FOR TEXT PROCESSING data that are ordered sequentially (temporally. r. etc. such as words in a sentence. Instead. and word alignment of parallel (translated) texts [150] (more in Section 6. or letters in a word. however. from a Gaussian distribution). an observation symbol from O is emitted according to the distribution Bq . B. r. In alternative definitions. stock market forecasting [70]. A hidden Markov model M is defined as a tuple S. gene finding [143]. S is a finite set of states. In an HMM. This model is parameterized by the tuple θ = A. Step 1. information extraction [139]. π consisting of an |S| × |S| matrix A of transition probabilities. observations may be emitted during the transition between states. Alternatively.4). where transitions are taken probabilistically. We further stipulate that Aq (r) ≥ 0.

Step 3. and a series of i. what is the most likely sequence of states that generated the data? y∗ = arg max Pr(x.2.d. which is the task of assigning to each word in an input sentence its grammatical category (one of the first steps in analyzing textual content). nn. noun. an observation vocabulary O. . . and an observation sequence of symbols from O. . θ) y∈Y 3. . . .} are a subset of English words. B. . . 6. y. x = x1 . . θ . x2 . the process repeats from Step 1. adjective. the answers to the first two questions are in principle quite trivial to compute: by iterating over all state sequences Y. Given a model M = S. θ and an observation sequence x. Y)? Pr(x) = y∈Y Pr(x. Figure 6. and observations O = {the. π that maximize the likelihood of the training data? θ∗ = arg max θ i=1 y∈Y Pr(xi . x . Given a set of states S. what is the probability that M generated the data (summing over all possible state sequences. .1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS There are three fundamental questions associated with hidden Markov models:8 1. .i. the word wash can be generated by an nn or v. This example illustrates a key intuition behind many applications of HMMs: states correspond to equivalence classes or clustering of observations. Given a model M = S. States S = {det.2. t is incremented. xτ . Since all events generated by this process are conditionally independent. and if t ≤ τ . a.3 shows a simple example of a hidden Markov model for part-ofspeech tagging. y. HIDDEN MARKOV MODELS 123 Step 2. adj. O. x2 . the probability that 8 The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125]. v} correspond to the parts of speech (determiner. the joint probability of this sequence of observations and the state sequence used to generate it is the product of the individual event probabilities. what are the parameters θ = A. a new q is drawn according to the distribution Aq .6. and verb). observation sequences x1 . θ) Using our definition of an HMM. . y. O. θ) 2. green. since wash can either be a noun or a verb). and a single observation type may associated with several clusters (in this example.

01 Example outputs: John might wash NN V V the big green person loves old plants DET ADJ ADJ NN V ADJ NN plants washes books books books NN V V NN V Figure 6.1 0. EM ALGORITHMS FOR TEXT PROCESSING Initial probabilities: DET ADJ NN V 0.2 0.124 CHAPTER 6.5 0.3 green big old might 0.3 0.5 0. Possible (probability > 0) transitions for the Markov process are shown graphically.3 0.1 Transition probabilities: DET ADJ NN DET ADJ NN V V ADJ 0 0 0 0.4 0. .1 0.1 might wash washes loves reads books 0.5 0.1 book plants people person John wash 0.2 0.3 0.7 0.3 0. the state sequences corresponding to the emissions are written beneath the emitted symbols.2 0.1 0.1 0.7 0.1 0.2 0.4 0.4 0.7 0. In the example outputs.3: An example HMM that relates part-of-speech tags to vocabulary items in an English-like language.2 0 0.2 0.19 0.1 0.1 0.1 DET NN V Emission probabilities: DET ADJ NN V the a 0.2 0.

and then summing the result or taking the maximum. x2 . wash . . There are two ways to compute the probability of x having been generated by M. Furthermore. HIDDEN MARKOV MODELS 125 each generated x can be computed by looking up and multiplying the relevant probabilities in A. fortunately. . det . Unfortunately. we will quickly run into trouble if we try to use this na¨ strategy since there are ıve |S|τ distinct state sequences of length τ . we can use dynamic programming algorithms to answer all of the above questions without summing over exponentially many sequences. det. θ . is much more efficient. and π. nn . This algorithm works by recursively computing the answer to a related question: what is the probability that the process is in state q at time t and has generated x1 . we assume that M is defined as shown in Figure 6. because the underlying model behaves exactly the same whenever it is in some state. B. the third question can be answered using EM. making exhaustive enumeration computationally intractable. regardless of how it got to that state.}. It is easy to see that the values of α1 (q) can be computed as the product of two independent probabilities: the probability of starting in state q and the probability of state q generating x1 : α1 (q) = πq · Bq (x1 ) From this. Thus. det. The second. det.2. .3. xt ? Call this probability αt (q). . . . as we hinted at in the previous section. |S| and transitioning to state r. it’s not hard to see that the values of α2 (r) for every r can be computed in terms of the |S| values in α1 (·) and the observation x2 : α2 (r) = Br (x2 ) · α1 (q) · Aq (r) q∈S This works because there are |S| different ways to get to state r at time t = 2: starting from state 1. For the purposes of illustration. .2. . det. 6. det. Question 1 asks what is the probability that this sequence was generated by an HMM M = S. And.2 THE FORWARD ALGORITHM Given some observation sequence. even with all the distributed computing power MapReduce makes available.6. since the set of possible labels is exponential in the length of x. θ as defined above. v . . . called a trellis. αt (q) is a two dimensional matrix (of size |x| × |S|). O. for example x = John. Fortunately. The first is to compute the sum over the joint probability of x and every possible labeling y ∈ { det. We can make use of what is known as the forward algorithm to compute the desired probability in polynomial time. because the behavior . We assume a model M = S. O. 2. might. As indicated above this is not feasible for most sequences. .

The lower panel shows the forward trellis. Summing over all y . the second question we might want to ask of M is: what is the most likely sequence of states that generated the observations? As with the previous question. . x2 . a more efficient answer to Question 2 can be computed using the same intuition in the forward algorithm: determine the best state sequence for a short sequence and extend this to easily compute the best sequence for longer ones. . consisting of 4 × 3 cells. to be the state used in this sequence at time t − 1. Figure 6. xt . Summing over the final column also yields 0. Continuing with the example observation sequence x = John. so the answer to Question 1 can be computed simply by summing over α values at time |x| for all states: Pr(x. the na¨ approach to solving this problem is to enumerate all possible ıve labels and find the one with the highest joint probability. the same result. The base case for the recursion is as follows (the state index of −1 is used as a placeholder since there is no previous best state at time t = 1): . 6. we define bpt (q). the Viterbi probability. x2 . .00018.2. . . v is the most likely sequence of states under our example HMM. xt . having generated x1 . This is known as the Viterbi algorithm.00018. We define γt (q). The upper panel shows the na¨ ıve exhaustive approach. y ).3 THE VITERBI ALGORITHM Given an observation sequence x. . . EM ALGORITHMS FOR TEXT PROCESSING of a Markov process is determined only by the state it is in at some time (not by how it got to that state).126 CHAPTER 6. might. The probability of the full sequence is the probability of being in time |x| and in any state.4 shows that y∗ = nn. the marginal probability of x is found to be 0. with the forward algorithm. v. examining the chart of probabilities in the upper panel of Figure 6. the “backpointer”. wash . . However. there are two ways of computing the probability that a sequence of observations x was generated by M: exhaustive enumeration with summing and the forward algorithm. enumerating all 43 possible labels y of x and computing their joint probability Pr(x. to be the most probable sequence of states ending in state q at time t and generating observations x1 . αt (r) can always be computed in terms of the |S| values in αt−1 (·) and the observation xt : αt (r) = Br (xt ) · αt−1 (q) · Aq (r) q∈S We have now shown how to compute the probability of being in any state q at any time t. Since we wish to be able to reconstruct the sequence of states.4 illustrates the two possibilities. θ) = q∈S α|x| (q) In summary.

0 from Step StepTHREEif 3.are theθparameters θ thatA. a O.4. max O. Given a model M = model θ and S. Since iall events used DET to Ai . of a hidden Markov model NN part of speech tagging. x what over τ Y?x2 observationpossiblexstate0.3 . 6. and of sequence of symbols of symbols from O. θ) multiplying John.Givenxa. parameters. .1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS MODELS MODELS 2.0 sequence DET follows. what is under the y∈Y y∈Y 7 The organizationsequence of is based that generated the data? Rabiner’s HMM tutorial [64]. θyPr(x. . and a key {det. what areB. what . of OV= and observations O = sequences. green. 0.0 a Markov modelV for part of DET V NN 0. y. θ = A. max andarg theysequence sequencey. Given a model M = model θ . θ) θ α1 α2 α3 α1 y∈Y α3 α1 α2 α3 α2 i=1 2. θ .3 shows an example of a hidden Markov model for part of speech tagging. most θ = 3 and x.M andS.0 ADJ DET 0.of states S.00006 V V NN Figure 2.003 V y∈Y Pr(x. Given a set of states S. .0 NN V NN 0. x sequences xthex2 . . DET from O according to DET ADJ 0. tutorial [64].3 is the an a hidden Markov model 0.i. subset of ALGORITHMS PROCESSING 38 CHAPTER 2. . . and a series of i. likely of this section states in part on ideas from Lawrence given in Figure 6. Y? DET V V 0. .03 . HIDDEN τMARKOV MODELS from Step eventsDET all 0. .0 ADJ V NNa hidden Markov model for part of speech tagging. nn. . x2 . EM 38 CHAPTER 2. t QUESTIONSt < τ . of i.d. what is the most clusters. anADJ V of example ofNN M generated the of speech tagging.0 the observation Step 1.O. y. .0 i .S. 1. and a single observation type maywith severalwith several of observations.the probability that parameters θ = A.4. Step 1.38 CHAPTER 2. EM ALGORITHMS 2. Step 1. Step 2. θ . nn.what. example. 7 The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [64]. x θ . .00018 example illustratesequivalence classes HMMs: intuition behind many intuition behind many of Pr(x.0 ADJ NN DET V observationsin this NN state sequence used to generate 0. y. 0. y.drawn iV is drawnNN emitted NN emitted ADJ iDET to NN is B a according to Ai .0003 0. Step 3.0 τ process Step 1. Step 2.y) John A sequence ofJohn mightofof observationsand a singleas generatedis mightlet t =as and letJohn might let t =p(x.0 DET NN 0. . θ) quite trivial to compute: quite trivial over all stateby iterating over allistate sequences Y. V S parts States S = a subset ofnn.0 drawn according V DET NN ADJ DET V ADJ Step 3. y.i.0 ADJ NN in this process this ADJ is incremented.NN V V ADJ NN NN V V NN V DET V DET 0.τ t is incremented. . that .probability that M observation O. the HMM. using 3. is DET an isDET DET the 0.0a newaccording to Ai . and observations O = {the. the English words.000081 the data.0 ADJ ADJ V 0. green.is stateY? 0. wash Step 0. π that . .is θ probability that M sequence of symbols from O. Step 2. a. xτ. used 0.0 isgenerateNN is DET the individual sequence the individual and the product of 0. or observations. and θ) vocabulary series of series the forward algorithm set anyobservation Pr(x.y∈Y .000021 1. θ) There There are fundamental questions Markov models: 0.d.an observation symbols from O.i.DET observation type may Step 0.0 generate NN is NN product ofV the individual 0. 0. EM ALGORITHMS FOR TEXT PROCESSING A sequence of observations of length τ is generated as follows. .4: a model M = by looking upy. an observation vocabulary O. let tVO DET select 0.00018 is the what is x. summing . O. O. θ) wash the most HMM probability of y∈Y = arg max Pr(x.observation observation sequence the what is the most S. O. nn.0 = S. DET probabilities. NN a model 0.0 conditionally with hidden joint models:7ADJ sequence sequence of DET V 0. statesto HMMs: of HMMs: states to equivalence classes correspond correspond to intuition behind and a single observation type may associated with several many applicationsy∈Y HMMs: states correspond to equivalence classes of or clusteringor clustering ofclustering of observations.0 event probabilities. θ) 0. might. . Given. Vto observations observationsV O = 0. .0 0. O. green. DET state initial state to the distribution distribution ADJobservation observation symbol from O state iinitial ADJ i according i according to0. theNN ADJ repeats Since Vall 1. v} to speech.000009 V There are process are conditionally independent.. let t = 0 and select an initial state i according to the distribution π.0 processthis process in are conditionally the 0.arg max∗ y. . what Given our definition of an HMM. an observationseriesand a i.Figure 2.and an vocabularythe data. EMthe English words. select = or clustering of observationsobservationsis generated asgenerated 0 follows. v} correspond correspond of speech. and an0. B.an . the joint probability of this NN DET of 0. .0 events used DET 3. associated t Stepandwash 0 A length Step 0.4. summingspeech tagging. This exampleaillustrates a key {the. of statessetGivenobservationpanel). the process repeats from Step 1. clusters. . θ) are the. Given a set Given a 3. 0. V of over all nn.incremented. . if t < τ . observation sequencesmax. 0. all x1sequences. equivalence classes a Pr(x) of the correspond This a key intuition behind many applications applicationsapplicationsθ) = 0. .0 event probabilities. x . a model observation sequencean observation sequence from O.3 shows example of States S = {det. v} correspond to the parts of speech.0 V ADJ example ADJ 0. Since all 0. τ .shows what shows probability that hidden ADJ for partdata. x.0 ADJ x = x3. most 2. Pr(xi what Pr(xi y. x2 . the example illustrates observations O = {the.0 individual DET NN ADJ 0. likely sequence of states that generated the data? John might wash 2. associated Markov models: There are three fundamental questions associated with hidden Markov models:7 y∈Y 1.0 of event event M probabilities. . . 0. .0 y∗ = arg max0. the probability probability state 7 The 7 The organization based section on this in part on ideas Rabiner’s ideas Rabiner’s HMM tutorial HMM 7 The this in part of ideas from is based from on HMM tutorial [64]. θ and an observation sequence x. y. v} adj. .} are a subset of a subset of the English words.i. anmax∗Pr(x. the answers to the first two questions are in principle in principle Given our an HMM.the joint probability of this sequence Vof ADJ used repeats ADJ from Step0. θ) Pr(xi y. πθ = A. . y. B. Given andS.0τ 0. a new i NN 2. Given a 1. τ as follows.0 .0 Step 1. θ) Pr(x) = θ) Pr(x.0 in are ADJ NNfundamental are ADJ independent.0 three associated NN DET DET 0. .0 1. aM generatedi.0 ADJ NN NN NN NN observationsNN and the state sequenceNN used 0. 2. Y? generated over all possible statesequences . θ) the relevant probabilities 2. Figure 2. O. States S = {det.=a 1 model the S. of length τ ADET according to the distribution π. adj.0 Step 1. . . . of = arg observation sequences xobservation x. to ADJ ADJ repeats 0. = S. if questionsNN process independent. M1=x2S. the probability that each ∗ generated x can be computed arg O. parameters = A. .0 ADJ V DET 0.3 by explicitly summing over all possible sequence labels (upper panel) and 3.0 NN ADJ V the 0.. Given observation sequence 0.} are {the. observation y∗ = and Figure Given Computing the S. .1 are three QUESTIONSquestions associated with hiddenwith7hidden Markov models:7 THREE fundamental FOR HIDDEN MARKOV MODELS 7 There are three fundamentalDET three0.y)and select generated Aobservations sequence of isof might wash ofp(x.0 observations ADJ the state 0. ADJtFOR.0 DET ADJ V an initial state idistribution Bthe distribution π. .0 emitted DET t <0. θ) Pr(x) = Pr(x.0 to it the 0. adj. and a i. = arg max . clusters. O. . the that each that each quite trivial by compute: to compute: over θall i=1 y∈Ythe probability that each to iterating by iterating sequences Y. data. a.1 THREE QUESTIONS FOR HIDDEN 2. . t is incremented.0 .1. Step 0.0 the NN and the partsand speech. πsumming over x1 sequences. Y? maximize Pr(x) = Pr(x. B. if t < . the joint probability of this sequence of observations and the state sequence used to generate it is the product of the individual event probabilities. adj. π.1 THREE QUESTIONS FOR HIDDEN MARKOV MARKOV 2. = 0 several observations. S.y) p(x.d. t is ADJ ADJ ADJ ADJ 2. i is drawn according to MODELS 127 new HIDDEN MARKOV Ai .0 =anADJ to the Vparts0. possible. . a.d.what.are x1. v} correspond toThis parts of speech. and a single observation type may associated associated or clustering of observations.0 anπ. t is incremented. Since Step 0.0 questions associated with hidden y.0 theused to ADJ product ofV product of and NN usedit to it the 0.0 symbols from O.0 V DET 0. what . xτ setx2 . = xx what observationis generated generated series . 0. the what the probability that the data.i. . B. and summingof summing .. O. 0.1DET 3. Given a model M = S. .0 for V x Vx1 xADJ x DET =Figure2 .y) sequencewashof length τJohn length τ is follows. y. select 0 0. .0 ADJ DET Vthe 0.0 NN DET DET an initial clusters.0 process the process repeats fromADJ 0.0 accordingStep 2. the 1 2 x x = all . ..} are of subset = statesEnglish words. . an observation symbol from O is emitted according to the distribution Bi .0 a new is 0.0 observations and the state sequenceADJ sequence NN NN generate it is thethe NN ADJ 0.} are a ALGORITHMS PROCESSINGThis example illustrates a key 38 38 2. of.00009 States V = {det. 0. Given a 2. the answers to the first two questions are in principle q∈S likely sequence of states likelygenerated of generated the data? the data? that sequence the data? likely sequence of states that states that generated quite trivial to compute: by iterating over all state sequences Y. NN over allthe likelihood of the training data? possible state sequences.d. This example illustrates key {the. DET DET V 0. the joint probability probability ofV this 0. θ∗ = arg x1θ x2 . CHAPTER CHAPTER FOR TEXT FOR TEXT FOR TEXT PROCESSING a.sequences Y.0 Vi isDET ADJ according thethe distribution Bi . and a single observation type may associated with several clusters. Step V 0.0 y∈Y Pr(x) =0.0 Pr(x. Given a modelPr(x) = an αθ (q) = 0.0are the M 0. DET. green.M. according to symbol ADJ DET is emitted according toaccording to the distribution distribution Binew i is 0.0 V DET 0.sequences. the = observation sequences 1 x2 . x1. θ) θ maximize the likelihood ofi=1 y∈Y θ i=1 y∈Y θ i=1 y∈Y the training data? Given our definition of definition ofdefinition of ananswers the the first twothe first two questions are in principle Given our an HMM. of states S. an an 0. possible0.0 0.correspond adj. Since all events used in this process are conditionally independent.} are the English words.NN Markov of this eventsNN of 0. a. .an 2.4. M = an θ and O. organization of this section isof organization is basedsection Lawrencein partLawrence from Lawrence Rabiner’s [64]. ∗ = . Step conditionally independent.0 S States state{det.3 shows. is. green. V Figure 2.2. xstates . a 6. π parameters that 1 maximize the likelihood maximize the likelihood of the training data? of the training data? maximize the likelihood of the training data? 3.NN DET an symbol an symbol from O 0.4. EM ALGORITHMS FOR states PROCESSING intuition behind many applications of HMMs: TEXT correspond to equivalence classes p(x.0 fromwith and DET DET 0. . NN drawn NN observation A . an a (lower ∗ vocabulary vocabulary O. M = an observation sequencean sequence x. to answers to questions are θ∗ = arg max Pr(x .000099 y∈Y θ∗ = arg max y∈Y Pr(xi . y.0 if < ADJ ADJ 0. π that .

the state with the highest probability path at time |x| is selected.0 DET be the most be the most be the most probable sequence ending in of ending in state and generating generating probable sequence of sequence ending in of of statestime t and generating time t and probable states of of states state q at state q at time t q at observationsobservationsobservations xtx1 . and then the backpointers are followed.3 THE2.be table to reconstruct reconstruct reconstruct the sequence of x1 . 128 the observation 6. consisting of 4 × 3 cells.3 using the·(q) (x1 ) πγ1·(q) (x1 ) The Bq (x1 )likely state sequence bp1 (q) recovered programmatically by following backpointers from = −1 (q) = −1 (q) = −1 bp1 bp1 is highlighted in bold and could be the maximal probabilityis recursion last column that of first forwardexcept rather than summing The recursion is similar The similar to forward the forwardexcept ratheralgorithm.John. 0. For example. . the naive approach to approach to approach to to enumerate all possible labels model. probability of x isbility of x bebility of xThe lowerto The shows panel shows panel shows the forward trellis. yjoint Pr(x. to This known as algorithm. t (q).4. Summing ).the chart one with the probability. also computes 0. a separate– a separate is not necessary. to construct the rest of the sequence: ∗ y|x| = arg max γ|x| (q) q∈S ∗ yt−1 = bpt (yt ) Figure 6. Summing over the final computes 0.0 as a place-holderplace-holderplace-holder is no previous no previous best state at time 1): as a since there is no previous bestthere is best state at time 1): as a since there since state at time 1): 0. over all y . is 0. might.03 base case for the recursionindex followsis(the of −1 index of −1 is used time t − caseNN t − case for the recursion is as follows is as of −1 index state is used time the recursion is as follows (the state 0. x wish . v is of most for CHAPTER observation observationwash might. v. . consisting found to is found to be found panel lower the forward the forward trellis. . the over previous states. Note that the back-pointer just index of the originating index is back-pointer just records the records records originating originating state – a separate–computationcomputationcomputation is not necessary. . except rather than summing The recursion that in the is similar to to the the column (thicker arrows). we define bpwe define “back define “back pointer”. that nn. . that nn. probability. the naive problem is problem is problem is to enumerate all possible labels and find theand find theand find the one with jointhighest joint probability. 0.00006state for The (the used 0.John. Note that the back-pointer just index of the state of thestate t is computed.00009 V γ1 γ2 γ3 γ1 γ2 γ3 γ1 γ2 γ3 Figure 6. a more efficient answer to can be computed using the same However. Summing over × 3 cells.γt (q). y∗ . consisting 0.the chart the chart one with the highest jointhighest the probability. v. Since we wish 0. states. to cell of the that of algorithm. y ). might. a Question 2 Question 2 Question 2 can be computed using intuition as intuition in intuition in the used in determine the best state sequence thesequence forsequence for we used as the used as we forward the forwarddetermine the best state best state we forward algorithm: algorithm: algorithm: determine for might a short sequence and extend this John this to easily compute thecompute the best sequence for longer ones.0 0. enumeratingenumeratingenumerating all x 3and y of x and y of x and shows the naive panel shows the naive all 43 possible labels y of labels shows the naive approach. Since . a more efficient answer to more efficient answer to can be computed using the same the same However. For example. recursively.0003to be able to the able to the sequence of xtx1 x2 .γWe defineViterbi the Viterbi probability.3 THE VITERBI ALGORITHM VITERBI ALGORITHM VITERBI ALGORITHM If we have an we have an we have an observation sequence and wish most probable label under label under a If observation observation sequence and wish most probable label the most probable a If sequence and wish to find the to find the to find under a model.0 be sequence of to states. For example. Note thattthecomputed.4. the marginal probacomputing their ).00018. γt (r) = max γt−1 (q) · Aq (r) · Br (xt ) q∈S q∈S bpt (r) = arg max γt−1 (q) · Aq (r) · Br (xt ) To compute the best sequence of states. The lower trellis. to pointer”. the maximumpossible intopossible at time r at time r at time over previous states. than summing over previous states. probability. the maximum value of all possible trajectories into state r at time t is computed. a short sequence and to easily compute the best easilywashsequence for longer ones.00018. column.00018. . be 0. . underwash PROCESSING for the sequence sequence our under HMM.0 . approach. algorithm. (q). we pointer”.5: Computing the most likely state sequence that generated John. wash under HMM.00018. the (q). We define0. Since . However. v. of 4 × 3 cells. The base 1. the maximumpossible trajectories trajectories into state into state maximum value of all value of all value of all state r trajectories t is computed. The base 1. 2. including backpointers that have been used to compute the most likely state sequence. x2 . for the sequence John.00018. examining examining examining of probabilitiesprobabilitiesprobabilities in Figure ?? the most likely sequencethe states likely states of in Figure ??in Figure ?? shows v is shows v is the most likely sequence of sequence of states of shows that nn. be the statebe thein this used state sequence at the bpt (q). a short sequence and extend this to sequence for longer ones.5 illustrates a Viterbi trellis. extend best This is known asis known asisalgorithm. Summing over all y .003 0. of 4 the final Summing over the final computes 0. the bpt to the “back used state sequence at time t − 1. . all 43 possible 4 possible labels computing their joint probability Pr(x. . y the marginalover all y . is not necessary. to be thein this used in this sequence at states. γ1 (q) = πq · Bq (x1 ) bp1 (q) = −1 The recursion is similar to that of the forward algorithm. alsocolumn.0 algorithm. Note that the backpointer simply records the index of the originating state—a separate computation is not necessary.upper panel upper panel upperapproach. the naive solving this solving this solving this to enumerate all possible labels model. ADJ.4. . except rather than summing over previous states. alsocolumn. .3 THE2. Summing the marginal probacomputing their joint probability probability Pr(x. πq · most q Viterbi q Bq = der the HMM given in Figure γ6.0 0. EM ALGORITHMS FOR TEXT exampleour exampleour example HMM.we wish . might.to we 0. to This the Viterbi the Viterbi the Viterbi We defineViterbi the γt (q). to t 0. wash un1 (q) = πγ1 Bq = algorithm. x2 .00018.

6.2. HIDDEN MARKOV MODELS

129

John
DET ADJ NN V

might

wash

Figure 6.6: A “fully observable” HMM training instance. The output sequence is at the top of the figure, and the corresponding states and transitions are shown in the trellis below.

6.2.4 PARAMETER ESTIMATION FOR HMMS We now turn to Question 3: given a set of states S and observation vocabulary O, what are the parameters θ∗ = A, B, π that maximize the likelihood of a set of training examples, x1 , x2 , . . . , x ?9 Since our model is constructed in terms of variables whose values we cannot observe (the state sequence) in the training data, we may train it to optimize the marginal likelihood (summing over all state sequences) of x using EM. Deriving the EM update equations requires only the application of the techniques presented earlier in this chapter and some differential calculus. However, since the formalism is cumbersome, we will skip a detailed derivation, but readers interested in more information can find it in the relevant citations [78, 125]. In order to make the update equations as intuitive as possible, consider a fully observable HMM, that is, one where both the emissions and the state sequence are observable in all training instances. In this case, a training instance can be depicted as shown in Figure 6.6. When this is the case, such as when we have a corpus of sentences in which all words have already been tagged with their parts of speech, the maximum likelihood estimate for the parameters can be computed in terms of the counts of the number of times the process transitions from state q to state r in all training instances, T (q → r); the number of times that state q emits symbol o, O(q ↑ o); and the number of times the process starts in state q, I(q). In this example, the process starts in state nn; there is one nn → v transition and one v → v transition. The nn state emits John in the first time step, and v state emits might and wash in the second and third time steps, respectively. We also define N (q) to be the number of times the process enters state q. The maximum likelihood estimates of the parameters in the fully observable case are:
9 Since

an HMM models sequences, its training data consists of a collection of example sequences.

130

CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING

πq =

=

I(q) r I(r)

Aq (r) =

For example, to compute the emission parameters from state nn, we simply need to keep track of the number of times the process is in state nn and what symbol it generates at each of these times. Transition probabilities are computed similarly: to compute, for example, the distribution Adet (·), that is, the probabilities of transitioning away from state det, we count the number of times the process is in state det, and keep track of what state the process transitioned into at the next time step. This counting and normalizing be accomplished using the exact same counting and relative frequency algorithms that we described in Section 3.3. Thus, in the fully observable case, parameter estimation is not a new algorithm at all, but one we have seen before. How should the model parameters be estimated when the state sequence is not provided? It turns out that the update equations have the satisfying form where the optimal parameter values for iteration i + 1 are expressed in terms of the expectations of the counts referenced in the fully observed case, according to the posterior distribution over the latent variables given the observations x and the parameters θ(i) : πq =
E[I(q)]

T (q → r) N (q) = r T (q → r )

Bq (o) =

O(q ↑ o) (6.2) N (q) = o O(q ↑ o )

Aq (r) =

E[T (q → r)] E[N (q)]

Bq (o) =

E[O(q ↑ o)] E[N (q)]

(6.3)

Because of the independence assumptions made in the HMM, the update equations consist of 2 · |S| + 1 independent optimization problems, just as was the case with the ‘observable’ HMM. Solving for the initial state distribution, π, is one problem; there are |S| solving for the transition distributions Aq (·) from each state q; and |S| solving for the emissions distributions Bq (·) from each state q. Furthermore, we note that the following must hold:
E[N (q)] =
r∈S

E[T (q → r)] =

o∈O

E[O(q ↑ o)]

As a result, the optimization problems (i.e., Equations 6.2) require completely independent sets of statistics, which we will utilize later to facilitate efficient parallelization in MapReduce. How can the expectations in Equation 6.3 be understood? In the fully observed training case, between every time step, there is exactly one transition taken and the source and destination states are observable. By progressing through the Markov chain, we can let each transition count as ‘1’, and we can accumulate the total number of times each kind of transition was taken (by each kind, we simply mean the number of times that one state follows another, for example, the number of times nn follows det). These statistics can then in turn be used to compute the MLE for an ‘observable’ HMM, as

6.2. HIDDEN MARKOV MODELS

131

described above. However, when the transition sequence is not observable (as is most often the case), we can instead imagine that at each time step, every possible transition (there are |S|2 of them, and typically |S| is quite small) is taken, with a particular probability. The probability used is the posterior probability of the transition, given the model and an observation sequence (we describe how to compute this value below). By summing over all the time steps in the training data, and using this probability as the ‘count’ (rather than ‘1’ as in the observable case), we compute the expected count of the number of times a particular transition was taken, given the training sequence. Furthermore, since the training instances are statistically independent, the value of the expectations can be computed by processing each training instance independently and summing the results. Similarly for the necessary emission counts (the number of times each symbol in O was generated by each state in S), we assume that any state could have generated the observation. We must therefore compute the probability of being in every state at each time point, which is then the size of the emission ‘count’. By summing over all time steps we compute the expected count of the number of times that a particular state generated a particular symbol. These two sets of expectations, which are written formally here, are sufficient to execute the M-step.
|x|

E[O(q ↑ o)] = E[T (q → r)] =

i=1 |x|−1 i=1

Pr(yi = q|x; θ) · δ(xi , o) Pr(yi = q, yi+1 = r|x; θ)

(6.4) (6.5)

Posterior probabilities. The expectations necessary for computing the M-step in HMM training are sums of probabilities that a particular transition is taken, given an observation sequence, and that some state emits some observation symbol, given an observation sequence. These are referred to as posterior probabilities, indicating that they are the probability of some event whose distribution we have a prior belief about, after addition evidence has been taken into consideration (here, the model parameters characterize our prior beliefs, and the observation sequence is the evidence). Both posterior probabilities can be computed by combining the forward probabilities, αt (·), which give the probability of reaching some state at time t, by any path, and generating the observations x1 , x2 , . . . , xt , with backward probabilities, βt (·), which give the probability of starting in some state at time t and generating the rest of the sequence xt+1 , xt+2 , . . . , x|x| , using any sequence of states to do so. The algorithm for computing the backward probabilities is given a bit later. Once the forward and backward probabilities have been computed, the state transition posterior probabilities and the emission posterior probabilities can be written as follows:

Pr(yi = q|x. The backward algorithm. and the correctness of the expression should be clear from the definitions of forward and backward probabilities. s3 } and O = {a. c}. Like the forward and Viterbi algorithms introduced above to answer Questions 1 and 2. θ). s2 .7: Using forward and backward probabilities to compute the posterior probability of the dashed transition.7) Equation 6. the probability of taking transition q → r (which is specified in the parameters. given x. given the observation sequence a b b c b. In this illustration. we assume an HMM with S = {s1 . yi+1 = r|x.6) (6.132 CHAPTER 6. Its base case starts at time |x|. EM ALGORITHMS FOR TEXT PROCESSING a S1 b b c b S2 S3 α2(2) β3(2) Figure 6. A visualization of the quantities used in computing this probability is shown in Figure 6. the backward algorithm uses dynamic programming to incrementally compute βt (·). and is defined as follows: . and the shaded area on the right corresponds to the backward probability β3 (s2 ). θ) = αi (q) · Aq (r) · Br (xi+1 ) · βi+1 (r) (6. the probability of taking a particular transition at a particular time.7. and the probability of generating the rest of the sequence. along any path. the probability of generating observation xi+1 from state r (also specified in θ). The intuition for Equation 6. is also not complicated: it is the product of four conditionally independent probabilities: the probability of getting to state q at time i (having generated the first part of the sequence). b. θ) = αi (q) · βi (q) Pr(yi = q.6 is the probability of being in state q at time i.7. The shaded area on the left corresponds to the forward probability α2 (s2 ).

θ(i) . HIDDEN MARKOV MODELS 133 β|x| (q) = 1 To understand the intuition for this base case.2. the forward and backward probabilities are computed using the algorithms given above (for this reason. To recap: each training instance x is processed independently. The process then repeats from the E-step using the new parameters.5 FORWARD-BACKWARD TRAINING: SUMMARY In the preceding section. These expectations are summed over all training instances. and the quality of the learned model may vary considerably. if some parameter is assigned a probability of 0 (either as an initial value or during one of the M-step parameter updates). and the complexity of the model. the number of times each state generates each output symbol type. the backward algorithm is computed from right to left and makes no reference to the start probabilities. only a handful of iterations are necessary. keep in mind that since the backward probabilities βt (·) are the probability of generating the remainder of the sequence after time t (as well as being in some state). we have shown how to compute all quantities needed to find the parameter settings θ(i+1) using EM training with a hidden Markov model M = S. The M-step involves normalizing the expected counts computed in the E-step using the calculations in Equation 6. As a result.6. whereas for others. this training algorithm is often referred to as the forward-backward algorithm). using the parameter settings of the current iteration. and since there is nothing left to generate after time |x|. 6. and the number of times each state transitions into each other state. a few practical considerations: HMMs have a non-convex likelihood surface (meaning that it has the equivalent of many hills and valleys in the number of dimensions corresponding to the number of parameters in the model).3. Strategies for optimal selection of initial parameters depend on the phenomena being modeled. . completing the E-step. which yields θ(i+1) . π.2. For some applications. The recursion is defined as follows: βt (q) = r∈S βt+1 (r) · Aq (r) · Br (xt+1 ) Unlike the forward and Viterbi algorithms. hundreds may be required. Additionally. Finally. O. EM training is only guaranteed to find a local maximum. θ(i) . the probability must be 1. EM will never change this in future iterations. The number of iterations required for convergence depends on the quality of the initial parameters. depending on the initial parameters that are used. The forward and backward probabilities are in turn used to compute the expected number of times the underlying Markov process enters into each state. For each x in the training data.

• Reducers sum together the required training statistics and solve one or more of the M-step optimization problems. Although the model being optimized determines the details of the required computations.134 CHAPTER 6. keeps track of the number of iterations and convergence criteria. • Combiners. since it provides a way of constraining the structures of the Markov model. which are static for the duration of the MapReduce job. Note that expected counts do not typically have this problem and can be represented using normal floating point numbers..d.g.e. EM ALGORITHMS FOR TEXT PROCESSING This can be useful. such as expected counts). The degree of parallelization that can be attained depends on the statistical independence assumed in the model and in the derived quantities required to solve the optimization problems in the M-step. • A controlling process (i. • Model parameters θ(i) . MapReduce implementations of EM algorithms share a number of characteristics: • Each iteration of EM is one MapReduce job. • Mappers map over independent training instances. Another pitfall to avoid when implementing HMMs is arithmetic underflow. in fact. are often quite effective at reducing the amount of data that must be written to disk.. HMMs typically define a massive number of sequences. . given that the startup costs associated with individual map tasks in Hadoop may be considerable.i. the E-step can generally be parallelized effectively since every training instance can be processed independently of the others. A very common solution to this problem is to represent probabilities using their logarithms. computing partial latent variable posteriors (or summary statistics. however. one must be aware of this behavior. 6. which sum together the training statistics. are loaded by each mapper from HDFS or other data provider (e.3 EM IN MAPREDUCE Expectation maximization algorithms fit quite naturally into the MapReduce programming model. See Section 5. and so the probability of any one of them is often vanishingly small—so small that they often underflow standard floating point representations. each independent training instance could be processed by a separate mapper!10 10 Although the wisdom of doing this is questionable.4 for additional discussion on working with log probabilities. In the limit. a distributed key-value store). Since parameters are estimated from a collection of samples that are assumed to be i.. driver program) spawns the MapReduce jobs.

Second. since the M-step of an HMM training iteration with |S| states in the model consists of 2 · |S| + 1 independent optimization problems that require non-overlapping sets of statistics. many common models (such as HMMs) require solving several independent optimization problems in the M-step.4. For each training instance. and this constrains the number of reducers that may be used. 2n + 1 stripes are emitted with unique keys. First. however. Since there is no limit on the number of mappers that may be run. however. In this situation. emitting expected event counts computed using the forwardbackward algorithm introduced in Section 6. and then generate parameter estimates for the next iteration using the updates given in Equation 6. This parallelization strategy is effective for several reasons. the M-step optimization problem will not decompose into independent subproblems. Each unique key corresponds to one of the independent optimization problems that will be solved in the M-step. a number of reducers may be run in parallel. A pairs approach that requires less memory at the cost of slower performance is also feasible. the majority of the computational effort in HMM training is the running of the forward and backward algorithms.3. The input consists of key-value pairs with a unique id as the key and a training instance (e.g.3. The quantities that are required to solve the M-step optimization problem are quite similar to the relative frequency estimation example discussed in Section 3. The pseudo-code for the HMM training mapper is given in Figure 6. the full computational resources of a cluster may be brought to bear to solve this problem. we can employ the stripes representation for aggregating sets of related values. being able to reduce in parallel helps avoid the data bottleneck that would limit performance if only a single reducer is used. the training of hidden Markov models parallelizes well in MapReduce. EM IN MAPREDUCE 135 Reducers. The process can be summarized as follows: in each iteration. as described in Section 3. Fortunately. it is possible that in the worst case. making it necessary to use a single reducer. Reducers aggregate the expected counts. and every training instance emits the same set of keys.8. 6. must aggregate the statistics necessary to solve the optimization problems as required by the model. rather than counts of observed events.6.. mappers process training instances. we aggregate expected counts of events. While the optimization problem is computationally trivial.2. The outputs are: .2. a sentence) as the value.3.1 HMM TRAINING IN MAPREDUCE As we would expect.3. The degree to which these may be solved independently depends on the structure of the model. Still. As a result of the similarity. this may be exploited with as many as 2 · |S| + 1 reducers running in parallel. HMM training mapper. completing the E-step.

136 CHAPTER 6. Section 6. . π ← ReadModelParams(iteration) method Map(sample id.4 I ← new AssociativeArray Initial state expectations for all q ∈ S do Loop over states I{q} ← α1 (q) · β1 (q) O ← new AssociativeArray of AssociativeArray Emissions for t = 1 to |x| do Loop over observations for all q ∈ S do Loop over states O{q}{xt } ← O{q}{xt } + αt (q) · βt (q) t←t+1 T ← new AssociativeArray of AssociativeArray Transitions for t = 1 to |x| − 1 do Loop over observations for all q ∈ S do Loop over states for all r ∈ S do Loop over states T {q}{r} ← T {q}{r} + αt (q) · Aq (r) · Br (xt+1 ) · βt+1 (r) t←t+1 Emit(string ‘initial ’. Section 6. stripe O{q}) Emit(string ‘transit from ’ + q. O ← ReadModel θ ← A. B. stripe I) for all q ∈ S do Emit(string ‘emit from ’ + q. EM ALGORITHMS FOR TEXT PROCESSING 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: class Mapper method Initialize(integer iteration) S.2.2 β ← Backward(x. stripe T {q}) Loop over states Figure 6. and transitions taken to generate the sequence. θ) cf. sequence x) α ← Forward(x.2. sequences of observations xi ) and generate the expected counts of initial states.8: Mapper pseudo-code for training hidden Markov models using EM.. emissions. The mappers map over training instances (i. θ) cf.e.

stripes [C1 . v ∈ Cf do z ←z+v Pf ← new AssociativeArray for all k.] do Sum(Cf . . stripe Cf ) class Reducer method Reduce(string t. . v ∈ Cf do Pf {k} ← v/z Final parameters vector Emit(string t. C2 .]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . C) Emit(string t. .9: Combiner and reducer pseudo-code for training hidden Markov models using EM. . . C) z←0 for all k. EM IN MAPREDUCE 137 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: class Combiner method Combine(string t.6. so reducers do not require special logic to handle different types of model parameters (since they are all of the same type). C2 . C2 . stripe Pf ) Figure 6. . .3. . . .] do Sum(Cf . C2 . . .]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . stripes [C1 . . The HMMs considered in this book are fully parameterized by multinomial distributions.

4 CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION To illustrate the real-world benefits of expectation maximization algorithms using MapReduce. The reducer for one iteration of HMM training. 97]. After successes with code-breaking during World War II. refer to [85. In the early years. and 3. Note that they will be spread across 2 · |S| + 1 keys. The optimal parameter settings for the following iteration are computed simply by computing the relative frequency of each event with respect to its expected count at the current iteration. with a unique key designating that the values are initial state counts. For a number of years. When the values for each key have been completely aggregated. The new computed parameters are emitted from the reducer and written to HDFS. which is an important task in statistical machine translation that is typically solved using models whose parameters are learned with EM.138 CHAPTER 6. the probabilities that the unobserved Markov process begins in each state q. and emission probabilities Bq for each state q. but translation pairs had to be developed indepen- . there was considerable optimism that translation of human languages would be another soluble problem. the expected number of times that state q generated each emission symbol o (the set of emission symbols included will be just those found in each training instance x). We begin by giving a brief introduction to statistical machine translation and the phrase-based translation approach. for a more comprehensive introduction. the associative array contains all of the statistics necessary to compute a subset of the parameters for the next EM iteration. HMM training reducer. work on translation was dominated by manual attempts to encode linguistic knowledge into computers—another instance of the ‘rule-based’ approach we described in the introduction to this chapter. shown together with an optional combiner in Figure 6. representing initial state probabilities π. with a key indicating that the associated value is a set of transition counts from state q. These early attempts failed to live up to the admittedly optimistic expectations. the expected number of times state q transitions to each state r. 2. transition probabilities Aq for each state q. the idea of fully automated translation was viewed with skepticism. aggregates the count collections associated with each key by summing them. we turn to the problem of word alignment. Not only was constructing a translation system labor intensive.9. Fully-automated translation has been studied since the earliest days of electronic computers. with a key indicating that the associated value is a set of emission counts from state q. 6. EM ALGORITHMS FOR TEXT PROCESSING 1.

The core idea of SMT is to equip the computer to learn how to translate. Furthermore. the field has grown tremendously and numerous statistical models of translation have been developed. los estudiantes. like many other topics we are considering in this book. Thus.1 STATISTICAL PHRASE-BASED TRANSLATION One approach to statistical translation that is simple yet powerful is called phrase-based translation [86]. SMT. After languishing for a number of years. with many incorporating quite specialized knowledge about the behavior of natural language as biases in their learning algorithms. such as the one used inside Google Translate. be leveraged to improve a French-English system. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 139 dently. We provide a rough outline of the process since it is representative of most state-of-the-art statistical translation systems. some students . translate between languages. as long as there is sufficient training data available. the students . is an attempt to leverage the vast quantities of textual data that is available to solve problems that would otherwise require considerable manual effort to encode specialized knowledge. for the most part. i am . 6.4. which took a data-driven approach to solving the problem of machine translation. rather than being specific to individual language pairs. improvements in learning algorithms and statistical modeling can yield benefits in many translation pairs at once. attempting to improve both the quality of translation while reducing the cost of developing systems [29]. translation systems can be developed cheaply and quickly for any language pair. using example translations which are produced for other purposes. and soy.12 Example phrase pairs for Spanish-English translation might include los estudiantes. the field was reinvigorated in the late 1980s when researchers at IBM pioneered the development of statistical machine translation (SMT). which contains pairs of sentences in two languages that are translations of each other. . The starting point is typically a parallel corpus (also called bitext). Parallel corpora 11 http://translate. θ) e With the statistical approach. they are not required to correspond to the definition of a phrase in any linguistic theory.com 12 Phrases are simply sequences of words. called phrases. Since the advent of statistical approaches to translation.11 Phrase-based translation works by learning how strings of words.6. many millions of such phrase pairs may be automatically learned. meaning that improvements in a Russian-English translation system could not. and modeling the process as a statistical process with some parameters θ relating strings in a source language (typically denoted as f) to strings in a target language (typically denoted as e): e∗ = arg max Pr(e|f.4. From a few hundred thousand sentences of example translations.google.

written as w1 for short. phrase-based translation depends on a language model. is a string in the target language. Thus. While an explanation of the process is not necessary here. The parallel corpus is then annotated with word alignments. That is. an n-gram language model is equivalent to k−1 a (n − 1)th-order Markov model. . . which indicate which words in one language correspond to words in the other. After phrase extraction. Since only target . . a conditional probability that reflects how likely the source phrase translates into the target phrase. this is not typically done in practice since the maximum likelihood solution turns out to be quite bad for this problem. we get: n n 2 n−1 Pr(w1 ) = Pr(w1 ) Pr(w2 |w1 ) Pr(w3 |w1 ) . A language model gives the probability that a string of words w = n w1 .8) k=1 Due to the extremely large number of parameters involved in estimating such a model directly.10) (6. By the chain rule of probability.10. By using these word alignments as a skeleton. Pr(wn |w1 ) = k−1 Pr(wk |w1 ) (6. taken together.140 CHAPTER 6. it is customary to make the Markov assumption. we can approximate P (wk |w1 ) as follows: k−1 bigrams: P (wk |w1 ) ≈ P (wk |wk−1 ) k−1 trigrams: P (wk |w1 ) ≈ P (wk |wk−1 wk−2 ) k−1 k−1 n-grams: P (wk |w1 ) ≈ P (wk |wk−n+1 ) (6. The collection of phrase pairs and their scores are referred to as the translation model. . proceedings of the Canadian Parliament in French and English. each phrase pair is associated with a number of scores which. . that the sequence histories only depend on prior local context. wn . for example. and text generated by the United Nations in many different languages. while the language model ensures that the output is fluent and grammatical in the target language. In addition to the translation model. we mention it as a motivation for learning word alignments. . phrases can be extracted from the sentence that is likely to preserve the meaning relationships represented by the word alignment.9) (6. are used to compute the phrase translation probability. which gives the probability of a string in the target language. EM ALGORITHMS FOR TEXT PROCESSING are frequently generated as the byproduct of an organization’s effort to disseminate information in multiple languages.11) n The probabilities used in computing Pr(w1 ) based on an n-gram language model are generally estimated from a monolingual corpus of target language text. which we show below how to compute with EM. The phrase-based translation process is summarized in Figure 6. We briefly note that although EM could be utilized to learn the phrase translation probabilities. The translation model attempts to preserve the meaning of the source language during the translation process. w2 .

Maria no dio una bofetada a la bruja verde Mary not t did not no give i a a slap slap slap l to t by to the to the the th witch it h green green witch did not give slap the witch Figure 6. which performs the actual translation. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 141 Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi. The best possible translation path is indicated with a dashed line. i saw) (la mesa pequeña. Both serve as input to the decoder.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde by a phrase-based model.4.6. . the small table) … he sat at the table the service was good Target Language Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence Figure 6. The translation model is constructed with phrases extracted from a word-aligned parallel corpus.10: The standard phrase-based machine translation architecture. The language model is estimated from a monolingual corpus.

5-grams) used in real-world applications. beat Kneser-Ney in a machine translation task? It should come as no surprise that the answer is yes. Being able to vary the order of the phrases used is necessary since languages may express the same ideas using different word orders. Brants et al. we refer the reader to a recent textbook by Koehn [85]. Brants et al. the gap between stupid backoff and Kneser-Ney narrowed. could a simpler smoothing strategy. Because the phrase translation probabilities are independent of each other and the Markov assumption made in the language model. the web). [25] reported experimental results that answered an interesting question: given the availability of large corpora (i. For many applications.11.4. In fact. We briefly touched upon this work in Chapter 1. To translate an input sentence f. language modeling has been well served by large-data approaches that take advantage of the vast quantities of text available on the web. stupid backoff didn’t work as well as Kneser-Ney in generating accurate and fluent translations. . 6.142 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING language text is necessary (without any additional annotation). which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. as the amount of data increased. the number of parameters can easily exceed the number of words from which to estimate those parameters.. However. Even after making the Markov assumption. The simplicity. applied to more text.. a state-of-the-art approach is known as Kneser-Ney smoothing [35].e. For a detailed introduction to phrase-based decoding. the phrase-based decoder creates a matrix of all translation possibilities of all substrings in the input string.g. training n-gram language models still requires estimating an enormous number of parameters: potentially V n . introduced a technique known as “stupid backoff” that was exceedingly simple and so na¨ that the ıve resulting model didn’t even define a valid probability distribution (it assigned arbitrary scores as opposed to probabilities). as an example illustrates in Figure 6. no matter how large. A sequence of phrase pairs is selected such that each word in f is translated exactly once. afforded an extremely scalable implementations in MapReduce.2 BRIEF DIGRESSION: LANGUAGE MODELING WITH MAPREDUCE Statistical machine translation provides the context for a brief digression on distributed parameter estimation for language models using MapReduce. most n-grams will never be observed in a corpus. For higher-order models (e. this may be done efficiently using dynamic programming. however. where V is the number of words in the vocabulary. and provides another example illustrating the effectiveness data-driven approaches in general. With smaller corpora. To cope with this sparseness.13 The decoder seeks to find the translation that maximizes the product of the translation probabilities of the phrases used and the language model probability of the resulting string in the target language. 13 The phrases may not necessarily be selected in a strict left-to-right order. researchers have developed a number of smoothing techniques [102]. In 2007.

4. can be learned automatically using EM. To make this model tractable. a|f) = Pr(a|f. respectively). we introduce a popular alignment model based on HMMs. the better its description of relevant language phenomena and hence its ability to select good translations. While this independence assumption is problematic in many ways. In this section. large data triumphs! For more information about estimating language models using MapReduce. 6. Since this does not change any of the important details of training. However. e pairs. we omit it from our presentation for simplicity. to combat 14 In the original presentation of statistical lexical translation models.4. a special null word is added to the source sentences. and their alignment is a latent variable.14 This means that the variable ai indicates the source word position of the ith target word generated. the observable variables are the words in the source and target sentences (conventionally written using the variables f and e. it results in a simple model structure that admits efficient inference yet produces reasonable alignments. with stupid backoff it was possible to train a language model on more data than was feasible with KneserNey smoothing. e) Alignment probability × Pr(ei |fai ) Lexical probability Since we have parallel corpora consisting of only f. we can learn the parameters for this model using EM and treating a as a latent variable. which means that the model’s parameters include the probability of any word in the source language translating to any word in the target language. we assume that words are translated independently of one another. Alignment models that make this assumption generate a string e in the target language by selecting words in the source language according to a lexical translation distribution. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 143 and eventually disappeared with sufficient data. which are necessary for building phrase-based translation models (as well as many other more sophisticated translation models). grammatical translations from a large hypothesis space: the more training data a language model has access to.6. which permits words to be inserted ‘out of nowhere’. the probability of an alignment and translation can be written as follows: |e| i=1 Pr(e. The role of the language model in statistical machine translation is to select fluent. Using these assumptions. and |a| = |e|. we refer the reader to a forthcoming book from Morgan & Claypool [26]. Applying this language model to a machine translation task yielded better results than a (smaller) language model trained with Kneser-Ney smoothing. Once again. Furthermore. In the statistical model of word alignment considered here. The indices of the words in f used to generate each word in e are stored in an alignment variable.3 WORD ALIGNMENT Word alignments. a. .

it is conventional to further simplify the alignment probability model. 6. which was written in C++ and designed to run efficiently on a single core. and the best alignment for a sentence pair can be found using the Viterbi algorithm. each with two single-core processors and two disks (38 cores total). but the ticks . It is attractive for initialization because it is convex everywhere. rather than time O(|e| · |f|2 ) required by the forward-backward algorithm. while the forward-backward algorithm could be used to compute the expected counts necessary for training this model by setting Aq (r) to be a constant value for all q and r. Finally. Figure 6. By letting the probability of an alignment depend only on the position of the previous aligned word we capture a valuable insight (namely. The favored simplification is to assert that all alignments are uniformly probable: 1 × Pr(ei |fai ) |f||e| i=1 |e| Pr(e. summing over all settings of a.4. and therefore EM will learn the same solution regardless of initialization. Both axes in the figure are on a log scale. words that are nearby in the source language will tend to be nearby in the target language).12 shows the performance of Giza++ in terms of the running time of a single EM iteration for both Model 1 and the HMM alignment model as a function of the number of training pairs.144 CHAPTER 6. EM ALGORITHMS FOR TEXT PROCESSING data sparsity in the alignment probability. a|f) = This model is known as IBM Model 1. We compared the training time of Giza++ and our aligner on a Hadoop cluster with 19 slave nodes.4 EXPERIMENTS How well does a MapReduce word aligner for statistical machine translation perform? We describe previously-published results [54] that compared a Java-based Hadoop implementation against a highly optimized word aligner called Giza++ [112]. and our model acquires the structure of an HMM [150]: |e| i=1 |e| i=1 Pr(e. a|f) = Pr(ai |ai−1 ) × Pr(ei |fai ) Transition probability Emission probability This model can be trained using the forward-backward algorithm described in the previous section. and use this simpler model to learn initial lexical translation (emission) parameters for the HMM. To properly initialize this HMM. the uniformity assumption means that the expected emission counts can be estimated in time O(|e| · |f|). we must make some further simplifying assumptions.

the alignment process is quite slow as the size of the training data grows—at one million sentences. Second. the opposite pattern from what we actually observe). and as is the case with so many problems we have encountered. the HMM alignment iterations begin to approach optimal runtime efficiency. Although these results do confound several variables (Java vs. although a comparison with Figure 6. At one million sentence pairs of training data. has almost the same running time as the HMM alignment algorithm. and in fact. there is a significant advantage to using the distributed implementation. even of Model 1. more data leads to improvements in translation quality [54]. even an unsophisticated implementation of the HMM aligner could compete favorably with a highly optimized single-core system whose performance is well-known to many people in the MT research community. Finally. Three things may be observed in the results. does not approach the theoretical performance of an ideal parallelization. as the amount of data increases. we note that. the relative cost of the overhead associated with distributing data. Recently a corpus of one billion words of French-English data was mined automatically from the web and released . marshaling and aggregating counts. these results show that when computation is distributed over a cluster of many machines. Furthermore.4.13 we plot the running time of our MapReduce implementation running on the 38-core cluster described above. In Figure 6.12 indicates that the MapReduce implementation is still substantially faster than the single core implementation. at least once a certain training data size is reached. We conclude that the overhead associated with distributing and aggregating data is significant compared to the Model 1 computations. Second. decreases. memory usage patterns). There are three things to note. First. the HMM is a constant factor slower than Model 1. Why are these results important? Perhaps the most significant reason is that the quantity of parallel data that is available to train statistical machine translation models is ever increasing. C++ performance. Model 1.6. assuming that there was no overhead associated with distributing computation across these machines. First. we plot points indicating what 1/38 of the running time of the Giza++ iterations would be at each data size. which we observe is light on computation. a single iteration takes over three hours to complete! Five iterations are generally necessary to train the models. which means that full training takes the better part of a day. For reference. the running time scales linearly with the size of the training data. in comparison to the running times of the single-core implementation. it is reasonable to expect that the confounds would tend to make the single-core system’s performance appear relatively better than the MapReduce system (which is. which gives a rough indication of what an ‘ideal’ parallelization could achieve. at large data sizes. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 145 on the y-axis are aligned with ‘meaningful’ time intervals rather than exact orders of magnitude. of course. Third.

. For reference. 1/38 running times of the Giza++ models are shown. EM ALGORITHMS FOR TEXT PROCESSING Model 1 HMM Average iteration latency (seconds) 3 hrs 60 min 20 min 5 min 90 s 30 s 10 s 3s 10000 100000 Corpus size (sentences) 1e+06 Figure 6. Optimal Model 1 (Giza/38) Optimal HMM (Giza/38) MapReduce Model 1 (38 M/R) MapReduce HMM (38 M/R) 3 hrs 60 min Time (seconds) 20 min 5 min 90 s 30 s 10 s 3s 10000 100000 Corpus size (sentences) 1e+06 Figure 6.146 CHAPTER 6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM training iterations at various corpus sizes.13: Running times of our MapReduce implementation of Model 1 and HMM training iterations at various corpus sizes.

For the purposes of this discussion. we hope. several independent researchers have shown that existing modeling algorithms can be expressed naturally and effectively using MapReduce. which refers to a class of techniques used to optimize any objective function. 6.statmt. phrase extraction. which means that we can take advantage of this data. In addition to being widely used supervised classification models in text processing (meaning that during training. is differentiable.6.5. and its derivatives can be efficiently evaluated. Single-core solutions to model construction simply cannot keep pace with the amount of translated data that is constantly being produced. EM-LIKE ALGORITHMS 15 147 publicly [33]. we will give examples in terms of minimizing functions. lead to rapid new developments in statistical machine translation. For the reader interested in statistical machine translation. this is the case for many important problems of interest in text processing. and discuss their implementation. 6. 15 http://www. Furthermore.html . both the data and their annotations must be observable).5. and they can be implemented quite naturally in MapReduce. These important algorithms are indispensable for learning models with latent structure from unannotated data. Gradient-based optimization is particularly useful in the learning of maximum entropy (maxent) models [110] and conditional random fields (CRF) [87] that have an exponential form and are trained to maximize conditional likelihood. Fortunately. some of the previously-introduced techniques are also applicable for optimizing these models.org/wmt10/translation-task. and phrase scoring [56]. As a result. Fortunately.1 GRADIENT-BASED OPTIMIZATION AND LOG-LINEAR MODELS Gradient-based optimization refers to a class of iterative optimization algorithms that use the derivatives of a function to find the parameters that yield a minimal or maximal value of that function. provided it is differentiable with respect to the parameters being optimized. In this section we focus on gradient-based optimization. We now explore some related learning algorithms that are similar to EM but can be used to solve more general problems. the results presented here show that even at data sizes that may be tractable on single machines. these algorithms are only applicable in cases where a useful objective exists. This improvement reduces experimental turnaround times. Obviously. their gradients take the form of expectations. there is an open source Hadoop-based MapReduce implementation of a training pipeline for phrase-based translation that includes word alignment.5 EM-LIKE ALGORITHMS This chapter has focused on expectation maximization algorithms and their implementation in the MapReduce programming framework. which allows researchers to more quickly explore the solution space— which will. significant performance improvements are attainable using MapReduce implementations.

F (θ) . Nevertheless. this strategy will find a local minimum of F . . EM ALGORITHMS FOR TEXT PROCESSING Assume that we have some real-valued function F (θ) where θ is a k-dimensional vector and that F is differentiable with respect to θ. (θ) ∂θ1 ∂θ2 ∂θk The gradient has two crucial properties that are exploited in gradient-based optimization. which are estimated by successive evaluations of F (θ). . then the following is true: F (θ∗ ) = 0 An extremely simple gradient-based minimization algorithm produces a series of parameter estimates θ(1) . (θ). However. . which are linear components of the objective and gradient. this update strategy may converge slowly. . Second. Like EM.148 CHAPTER 6. Gradient-based optimization algorithms can often be implemented effectively in MapReduce. The values they emit are pairs F (θ). This implies that the gradient also decomposes linearly. More sophisticated algorithms perform updates that are informed by approximations of the second derivative. and can converge much more rapidly [96]. the details of the function being optimized determines how it should best be implemented. First. Provided this value is small enough that F decreases. . by starting with some initial parameter settings θ(1) and updating parameters through successive iterations according to the following rule: θ(i+1) = θ(i) − η (i) F (θ(i) ) (6. Gradient-based optimization in MapReduce. Its gradient is defined as: F (θ) = ∂F ∂F ∂F (θ). where the structure of the model determines the specifics of the realization. . . θ(2) . the gradient F is a vector field that points in the direction of the greatest increase of F and whose magnitude indicates the rate of increase. and not every function optimization problem will be a good fit for MapReduce.12) The parameter η (i) > 0 is a learning rate which indicates how quickly the algorithm moves along the gradient during iteration i. . while simple. and proper selection of η is non-trivial. • The objective should decompose linearly across training instances. if θ∗ is a (local) minimum of F. MapReduce implementations of gradient-based optimization tend to have the following characteristics: • Each optimization iteration is one MapReduce job. and therefore mappers can process input data in parallel.

13) (6. • Whether more than one reducer can run in parallel depends on the specific optimization algorithm being used. like the trivial algorithm of Equation 6. Gradient-based optimization techniques can be quite effectively used to learn the parameters of probabilistic models with a log-linear parameterization [100].d. and emit θ(i+1) .5. Hi are real-valued functions sensitive to features of the input and labeling. .i. We therefore include a brief summary. Parameter learning for log-linear models. EM-LIKE ALGORITHMS 149 • Evaluation of the function and its gradient is often computationally expensive because they require processing lots of data. y) y exp i θi · Hi (x.: F (θ) = x. Log-linear models are particularly useful for supervised learning (unlike the unsupervised models learned with EM). compute the total objective and gradient.12 treat the dimensions of θ independently. y 1 . In this case. where an annotation y ∈ Y is available for every x ∈ X in the training data. θ) = exp i θi · Hi (x. Some. While a comprehensive introduction to these models is beyond the scope of this book. • Many optimization algorithms are stateful and must persist their state between optimization iterations. The parameters of the model is selected so as to minimize the negative conditional log likelihood of a set of training instances x. . run the optimization algorithm.2 for a discussion. . which may otherwise be computationally expensive. θ) θ (6. and their training using gradient-based optimization.14) θ∗ = arg min F (θ) . can be implemented effectively using MapReduce. Such external side effects must be handled carefully. x. y 2 .y − log Pr(y|x. This make parallelization with MapReduce worthwhile. In the latter case. such models are used extensively in text processing applications. whereas many are sensitive to global properties of F (θ). parallelization across multiple reducers is non-trivial. . refer to Section 2. • Reducer(s) sum the component objective/gradient pairs. y ) In this expression. This may either be emitted together with θ(i+1) or written to the distributed file system as a side effect of the reducer. it is possible to directly model the conditional distribution of label given input: Pr(y|x.6. which we assume to be i.

As we saw in the previous section. given that the manual effort required to generate annotated data remains a bottleneck in many supervised approaches. Thus. However. such as the widely-used hidden Markov model (HMM) trained using EM. [140]. the forward-backward algorithm introduced in Section 6. when very large event spaces are being modeled. EM algorithms are unsupervised learning algorithms. First. Second. y )] x.4 can. they are particularly well-positioned to take advantage of large clusters. be used to compute the expectation EPr(y |x. In Chapter 1. y )] needed in CRF sequence models. y) − EPr(y |x. these algorithms can be expressed naturally in the MapReduce programming model. meaning it can be optimized quite well in MapReduce. as long as the feature functions respect the same Markov assumption that is made in HMMs.e. given the enormous quantities of data available “for free”. as is the case with sequence labeling. make independence assumptions that permit an high degree of parallelism in both the E. This. when we hailed large data as the “rising tide that lifts all boats” to yield more effective algorithms. with only minimal modification. The gradient derivative of F with respect to θi can be shown to have the following form [141]:16 ∂F (θ) = ∂θi Hi (x.2..y The expectation in the second part of the gradient’s expression can be computed using a variety of techniques. there are no additional latent variables). we were mostly referring to unsupervised approaches. This is quite important. MapReduce offers significant speedups when training iterations require running the forward-backward algorithm. the objective decomposes linearly across training instances. And. as was the case with HMMs. enumerating all possible values y can become computationally intractable. For more information about inference in CRFs using the forward-backward algorithm. y is present the model is fully observed (i.and M-steps.θ) [Hi (x. Data acquisition for unsupervised algorithms is often as simple as crawling specific web sources. .150 CHAPTER 6. 6.θ) [Hi (x. independence assumptions can be used to enable efficient computation using dynamic programming. which means that they have access to far more training data than comparable supervised approaches. using expectation maximization algorithms or gradient-based optimization techniques. making them a good example of how to express a commonly-used algorithm in this new framework. In fact. The same pattern of results holds when training linear CRFs. Finally.13 makes clear. many models. We focused especially on EM algorithms for three reasons. as we saw with EM. combined with the ability of MapReduce to process 16 This assumes that when x. EM ALGORITHMS FOR TEXT PROCESSING As Equation 6. we refer the reader to Sha et al.6 SUMMARY AND ADDITIONAL READINGS This chapter focused on learning the parameters of statistical models from data.

17 and it is also the subject of a forthcoming book [116].apache. Because of its ability to leverage large amounts of training data. 17 http://lucene.org/mahout/ . Issues associated with a MapReduce implementation of latent Dirichlet allocation (LDA). machine learning is an attractive problem for MapReduce and an area of active research. Additional Readings. [37] presented general formulations of a variety of machine learning problems. which is another important unsupervised learning technique. focusing on a normal form for expressing a variety of machine learning algorithms in MapReduce. SUMMARY AND ADDITIONAL READINGS 151 large datasets in parallel. [151]. can also be implemented in MapReduce. Since EM algorithms are relatively computationally expensive. the ability to leverage clusters of commodity hardware to parallelize computationally-expensive algorithms is an important use case. have been explored by Wang et al. Chu et al. Although MapReduce has been designed with mostly dataintensive applications in mind. The Apache Mahout project is an open-source implementation of these and other learning algorithms. The discussion demonstrates that not only does MapReduce provide a means for coping with ever-increasing amounts of data. even for small amounts of data. this led us to consider how related supervised learning models (which typically have much less training data available).6. provides researchers with an effective strategy for developing increasingly-effective applications. with certain similarities to EM.6. but it is also useful for parallelizing expensive computations.

1 LIMITATIONS OF MAPREDUCE As we have seen throughout this book. As a result. 7. Section 7. In the preceding chapters. This simply means that the full “batch” of training data is processed before any updates to the model parameters are made. waiting to be unlocked by the right computational tools. Not only are terabyte. there are many examples of algorithms that depend crucially on the existence of shared global state during processing. and the book concludes in Section 7.3. Recall from Chapter 6 the concept of learning as the setting of parameters in a statistical model.1 covers online learning algorithms and Monte Carlo simulations. In the commercial sphere. The first example is online learning.and petabyte-scale datasets rapidly becoming commonplace.152 CHAPTER 7 Closing Remarks The need to process enormous quantities of data has never been greater. larger datasets lead to more effective algorithms for a wide range of tasks. but there is consensus that great value lies buried in them. For engineers building information processing tools and applications. the ability to analyze massive amounts of data may provide the key to unlocking the secrets of the cosmos or the mysteries of life. this is difficult to accomplish in MapReduce. business intelligence—driven by the ability to gather data from a dizzying array of sources—promises to help organizations better understand their customers and the marketplace.2 discusses alternative programming models. hopefully leading to better business decisions and competitive advantages. On one hand. we have shown how MapReduce can be exploited to solve a variety of problems related to text processing at scales that would have been unthinkable a few years ago. so it is only fair to discuss the limitations of the MapReduce programming model and survey alternatives. this is quite reasonable: . from machine translation to spam detection. since map and reduce tasks run independently and in isolation. As we have seen. which are examples of algorithms that require maintaining global state. they can be expressed naturally in MapReduce. solutions to many interesting problems in text processing do not require global synchronization. Section 7. However. making them difficult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing). Both EM and the gradient-based learning algorithms we described are instances of what are known as batch learning algorithms. no tool—no matter how powerful or flexible— can be perfectly adapted to every task. However. In the natural and physical sciences.

these updates must occur after processing smaller numbers of instances. in the driver code). In a batch learner. synchronization of this resource is enforced by the MapReduce framework. but inferring the number of states. Thinking in terms of gradient optimization (see Section 6. online learning algorithms can be understood as computing an approximation of the true gradient. but is rather inferred from training data. where updates occur in one or more reducers (or. All processes performing the evaluation (presumably the mappers) must have access to this state.1. and then simple frequency statistics are computed over the samples. these high startup costs become intolerable. which goes against the design choices of most existing MapReduce implementations. show that the merged solution is inferior to the output of running the training algorithm on the entire dataset [52]. to be hasty. This sort of inference is particularly useful when dealing with so-called nonparametric models. However. An earlier update would seem. Although only an approximation. which are used to perform inference in probabilistic models where evaluating or representing the model exactly is impossible. rather than having them specified. In Hadoop. A final solution is then arrived at by merging individual results. For an illustration. map and reduce tasks have considerable startup costs. Being able to parallelize Monte Carlo simulations would be tremendously valuable. This is acceptable because in most circumstances. using only a few training instances. Unfortunately. such a style of computation would likely result in inefficient use of resources. updates can be made after every training instance. particularly for unsupervised learning ap- . this cost is amortized over the processing of many key-value pairs. This means that the framework must be altered to support faster processing of smaller datasets. LIMITATIONS OF MAPREDUCE 153 updates are not made until the full evidence of the training data has been weighed against the model. imagine learning a hidden Markov model. it is generally the case that more frequent updates can lead to more rapid convergence of the model (in terms of number of training instances processed). An alternative is to abandon shared global state and run independent instances of the training algorithm in parallel (on different portions of the data).5). which are models whose structure is not specified in advance. for example. even if those updates are made by considering less data [24]. The model parameters in a learning algorithm can be viewed as shared global state. However. In the limit. and the aggregate behavior of multiple updates tends to even out errors that are made. however. in some sense.7. implementing online learning algorithms in MapReduce is problematic. However. the gradient computed from a small subset of training instances is often quite reasonable. A related difficulty occurs when running what are called Monte Carlo simulations. with online learning. for small datasets. alternatively. The basic idea is quite simple: samples are drawn from the random variables in the model to simulate its behavior. Experiments. which must be updated as the model is evaluated against training data. Since MapReduce was specifically optimized for batch operations over large amounts of data.

which have a large number of (relatively simple) functional units [104]. and only once. given its batch-oriented design). respectively) are sufficiently mature to handle the low-latency and high-throughput demands of maintaining global state in the context of massively distributed processing (but recent benchmarks are encouraging [40]). etc. which is a sparse. tweets. Amazon’s Dynamo [48].). Unfortunately. One approach is to build a distributed datastore capable of maintaining the global state.2 ALTERNATIVE COMPUTING PARADIGMS Streaming algorithms [3] represent an alternative programming model for dealing with large volumes of data with limited computational and storage resources. 7. In the context of text processing. such a system would need to be highly scalable to be used in conjunction with MapReduce. distributed. which is a distributed key-value store (with a very different architecture). The dataflow graph is a logical computation graph that is automatically mapped onto physical resources by the framework. Furthermore. it is unclear if the open-source implementations of these two systems (HBase and Cassandra. CLOSING REMARKS plications where they have been found to be far more effective than EM-based learning (which requires specifying the model). A Dryad job is a directed acyclic graph where each vertex represents developer-specified computations and edges represent data channels that capture dependencies. although it wasn’t originally designed with such an application in mind. The idea of stream processing has been generalized in the Dryad framework as arbitrary dataflow graphs [75. but more interestingly. The problem of global state is sufficiently pervasive that there has been substantial work on solutions. they can often take advantage of modern GPUs. persistent multidimensional sorted map built on top of GFS. At runtime. might also be useful in this respect. since streaming algorithms are comparatively simple (because there is only so much that can be done with a particular training instance). and has been used in exactly this manner. 159]. fits the bill. translation modeling [89]. sensor readings. Although recent work [10] has shown that the delays in synchronizing sample statistics due to parallel implementations do not necessarily damage the inference. and detecting the first mention of news event in a stream [121]. MapReduce offers no natural mechanism for managing the global shared state that would be required for such an implementation. However. streaming algorithms have been applied to language modeling [90]. channels . Google’s BigTable [34]. This model assumes that data are presented to the algorithm as one or more streams of inputs that are processed in order. which could be files in a distributed file system.154 CHAPTER 7. data from an “external” source or some other data gathering device. which is difficult in MapReduce (once again. The model is agnostic with respect to the source of these streams. Stream processing is very attractive for working with time-series data (news feeds.

there is a proposal to add a third merge phase after map and reduce to better support relational operations [36]. however. which was inspired by Google’s Sawzall [122]. It is merely the first of many approaches to harness large-scaled distributed computing resources. However. TCP pipes. and in fact. if not impossible. Even within the Hadoop/MapReduce ecosystem. Pig [114]. constructs in the language allow developers to specify data transformations (filtering. it remains unclear.) at a much higher level. Hive [68]. However. we have already observed the development of alternative approaches for expressing distributed computations. separating the what from the how of computation and isolating the developer from the details of concurrent programming. This process makes certain tasks easier. These conceptions represent different tradeoffs between simplicity and expressivity: for example. this begs the obvious question: What other abstractions are available in the massively-distributed datacenter environment? Are there more appropriate computational models that would allow us to tackle classes of problems that are difficult for MapReduce? Dryad and Pregel are alternative answers to these questions. Dryad is more flexible than MapReduce. grouping. Similarly. a longer description is anticipated in a forthcoming paper [99]. but others more difficult. at least at present. or shared memory. Although Pig scripts (in a language called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine. ALTERNATIVE COMPUTING PARADIGMS 155 are used to transport partial results between vertices. etc. MapReduce is certainly no exception to this generalization. and perhaps not even the best. They share in providing an abstraction for large-scale distributed computations. Another system worth mentioning is Pregel [98]. which implements a programming model inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. Looking forward. can be described as a data analytics platform that provides a lightweight scripting language for manipulating large datasets. As anyone who has taken an introductory computer science course would know. and one of the goals of this book has been to give the reader a better understanding of what’s easy to do in MapReduce and what its limitations are. and can be realized using files.7.2. What is the significance of these developments? The power of MapReduce derives from providing an abstraction that allows developers to harness the power of large clusters. MapReduce is not the end. in how distributed computations are conceptualized: functional-style programming. MapReduce can be trivially implemented in Dryad. They differ. arbitrary dataflows. But of course. another open-source . we can certainly expect the development of new models and a better understanding of existing ones. joining. but unfortunately there are few published details at present. which approach is more appropriate for different classes of applications. Pregel was specifically designed for large-scale graph algorithms. abstractions manage complexity by hiding details and presenting well-defined behaviors to users of those abstractions. or BSP. For example.

which eliminates capital investments associated with cluster infrastructure. This. avoiding random seeks. while simultaneously taking advantage of Hadoop’s data processing capabilities. By scaling “out” with commodity servers. and a host of other issues in distributed computing. But this has only been possible with corresponding innovations in software and how computations are organized on a massive scale. The engineers at Google deserve a tremendous amount of credit for that. error recovery. fault tolerance. None of these points are new or particularly earth-shattering—computer scientists have known about these principles for decades. Most important of all. all these ideas came together and were demonstrated on practical problems at scales unseen before. The golden age of massively distributed computing is finally upon us. as opposed to the other way around. paves the way for innovations in scalable algorithms that can run on petabyte-scale datasets. emphasizing throughput over latency for batch tasks by sequential scans through data. also. the system provides a data analysis tool for users who are already comfortable with relational databases. These abstractions are at the level of entire datacenters. Add to that the advent of utility computing. for the first time.3 MAPREDUCE AND BEYOND The capabilities necessary to tackle large-data problems are already within reach by many and will continue to become more accessible over time. Furthermore. Hive queries (in HiveQL) “compile down” to Hadoop jobs by the Hive query engine. MapReduce is unique in that. which has made MapReduce accessible to everyone and created the vibrant software ecosystem that flourishes today. However.156 CHAPTER 7. however. CLOSING REMARKS project. . 7. in turn. both in terms of computational resources and the impact on the daily lives of millions. large-data processing capabilities are now available “to the masses” with a relatively low barrier to entry. the engineers and executives at Yahoo deserve a lot of credit for starting the open-source Hadoop project. we have been able to economically bring large clusters of machines to bear on problems of interest. is the development of new abstractions that hide system-level details from the application developer. provides an abstraction on top of Hadoop that allows users to issue SQL queries against large relational datasets stored in HDFS. Important ideas include: moving processing to the data. and also for sharing their insights with the rest of the world. Therefore. and provide a model using which programmers can reason about computations at a massive scale without being distracted by fine-grained concurrency management.

Katz. [9] Michael Armbrust. Neil Conway. Sears. Khaled Elmeleegy. Jeanna Neefe. 2009. BOOM: Data-centric programming in the datacenter. Lyon. Joseph M. Michael Dahlin. Philadelphia. pages 483–485. and Russell C. and Randolph Wang. Pennsylvania. Information Retrieval. San Diego. Anthony D. Reviews of Modern Physics. 2002. Yossi Matias. Patterson. Joseph. Hellerstein. pages 109–126. 2005. pages 922–933. and Alexander Rasin. Daniel Abadi. Serverless network file systems. 8(1):151–166. Copper Mountain Resort. France. [2] R´ka Albert and Albert-L´szl´ Barab´si. Technical Report UCB/EECS-2009-98. California. In Proceedings of the 35th International Conference on Very Large Data Base (VLDB 2009). [8] Vo Ngoc Anh and Alistair Moffat. Drew Roselli. Tyson Condie. Rean Griffith. 1995. David Patterson. Statistical mechanics of complex nete a o a works. 1996. 74:47–97. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP 1995). Cloud analytics: Do we really need to reinvent the storage stack? In Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing (HotCloud 09). Prasenjit Sarkar. 2009. Avi Silberschatz. Kamil Bajda-Pawlikowski. Andrew Konwinski. Armando Fox. Inverted index compression using word-aligned binary codes. [3] Noga Alon. Gunho Lee. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC ’96). The space complexity of approximating the frequency moments. 2009. Colorado. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Prashant Pandey. Ariel Rabkin. David A. and Renu Tewari. Mansi Shah. Electrical Engineering and Computer Sciences. Randy H. University of California at Berkeley. Karan Gupta. [5] Gene Amdahl. 1967. Ion .157 Bibliography [1] Azza Abouzeid. [4] Peter Alvaro. and Mario Szegedy. Validity of the single processor approach to achieving large-scale computing capabilities. [6] Rajagopal Ananthanarayanan. In Proceedings of the AFIPS Spring Joint Computer Conference. [7] Thomas Anderson. Himabindu Pucha. pages 20–29.

Alex Ho. Electrical Engineering and Computer Sciences. Challenges on distributed web retrieval. and Fabrizio Silvestri. Rolf Neugebauer. and Fabrizio Silvestri. Scaling to very very large corpora for natural language disambiguation. 2009. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001). Turkey. and Andrew Warfield. pages 183–190. 2007. 2007. France. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE 2007). Tim Harris. Flavio Junqueira. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003). Ian Pratt. Morgan & Claypool Publishers. pages 164–177. British Columbia. [10] Arthur Asuncion. Padhraic Smyth. [14] Michele Banko and Eric Brill. 2005. 2007. pages 26–33. [16] Luiz Andr´ Barroso. .158 CHAPTER 7. The Netherlands. Web search for a planet: The e o Google cluster architecture. and Max Welling. and Matei Zaharia. Vanessa Murdock. Istanbul. PageRank increase uno der different collusion topologies. Japan. pages 17–24. pages 6–20. Boris Dragovic. [18] Luiz Andr´ Barroso and Urs H¨lzle. Carlos Castillo. 2003. and Urs H¨lzle. Vassilis Plachouras. Asynchronous distributed learning of topic models. The Datacenter as a Computer: An Introduce o tion to the Design of Warehouse-Scale Machines. CLOSING REMARKS Stoica. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007). and Vicente L´pez. 2008. Technical Report UCB/EECS-2009-28. Bolton Landing. The impact of caching on search engines. Aristides Gionis. [12] Ricardo Baeza-Yates. IEEE Micro. 40(12):33–37. pages 81–88. Above the clouds: A Berkeley view of cloud computing. Carlos Castillo. Vassilis Plachouras. [17] Luiz Andr´ Barroso and Urs H¨lzle. [11] Ricardo Baeza-Yates. 2009. The case for energy-proportional computing. Chiba. Flavio Junqueira. Amsterdam. [15] Paul Barham. Steven Hand. University of California at Berkeley. Jeffrey Dean. New York. e o Computer. 23(2):22–28. In Advances in Neural Information Processing Systems 21 (NIPS 2008). 2003. Canada. Xen and the art of virtualization. Keir Fraser. Vancouver. 2001. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). Toulouse. [13] Ricardo Baeza-Yates.

1995. Massachusetts. Prague. Morgan & Claypool Publishers. 2005. 1999. Berlin. Cambridge. Stochastic learning. Ghaleb Abdulla. Sergei Nikolaev. MAPREDUCE AND BEYOND 159 [19] Jacek Becla. California.3. Reading. Wang. Stanford Linear Accelerator Center. Charles L. [21] Gordon Bell. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001). Massachusetts. Tony Hey. Susan Dumais. Cormack. MIT Press. Computational Linguistics. Inside PageRank. Mercer. and Gordon V. Och. Collected Fictions (translated by Andrew Hurley). Peng Xu. Alex Szalay. Maryland. Vincent J. 19(2):263–311. 323(5919):1297–1298. Penguin. pages 393–400. Clarke. 2005.7. Science. Della Pietra. Maria Nieto-Santisteban. 2004. 2001. Ashok C. and Jeffrey Dean. A. Czech Republic. [26] Thorsten Brants and Peng Xu. pages 858–867. pages 146–168. Jimmy Lin. Dataintensive question answering. Beyond the data deluge. Gaithersburg. Advanced Lectures on Machine Learning. In Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005). Addison-Wesley. Ani Thakar. [25] Thorsten Brants. Della Pietra. 2010. The Mythical Man-Month: Essays on Software Engineering. SLAC Publications SLAC-PUB-12292. [22] Monica Bianchini. Stephen A. Brown. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Popat. ACM Transactions on Internet Technology. May 2006. e editors. and Andrew Ng. [24] L´on Bottou. Lecture Notes in Artificial Intelligence. In Olivier Bousquet and Ulrike von Luxburg. . 2010. Brooks. Anniversary Edition. [23] Jorge Luis Borges. and Franco Scarselli. and Alex Szalay. Franz J. Information u Retrieval: Implementing and Evaluating Search Engines. [28] Frederick P. 2007. 5(1):92–128. Distributed Language Models. [30] Stefan B¨ttcher. Michele Banko. [27] Eric Brill. 1993. Designing a multi-petabyte database for LSST. Lessons learned from managing a petabyte. [20] Jacek Becla and Daniel L. The mathematics of statistical machine translation: Parameter estimation. Marco Gori. Springer Verlag. 2009. LNAI 3176. Asilomar. Andrew Hanushevsky. and Jim Gray. [29] Peter F. and Robert L. Large language models in machine translation.

Computing in Science and Engineering. Wilson C. Sanjay Ghemawat. Philipp Koehn. [39] Jonathan Cohen. and Robert Gruber. Indiana. Raghu Ramakrishnan. [34] Fay Chang. pages 281–288. James Broberg. Ruey-Lung Hsiao. Jeffrey Dean. 2009. Canada. Computational Linguistics. Church and Patrick Hanks. Andrew Fikes.160 CHAPTER 7. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996). In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 2009. Cloud computing and emerging IT platforms: Vision. 2007. pages 1–28. 4(4):405–436. Srikumar Venugopal. and Josh Schroeder. 2006. Benchmarking cloud serving systems with YCSB. . British Columbia. In Advances in Neural Information Processing Systems 19 (NIPS 2006). Stott Parker. Wallach. [38] Kenneth W. E. YuanYuan Yu. Computer Systems. mutual information. 25(6):599–616. Michael Burrows. Adam Silberstein. Map-Reduce for machine learning on multicore. Sang Kyun Kim. 1996. Gary Bradski. Santa Cruz. [36] Hung chih Yang. and D. Seattle. 11(4):29–41. [32] Luis-Felipe Cabrera and Darrell D. Yi-An Lin. Findings of the 2009 workshop on statistical machine translation. Long. Erwin Tam. and reality for delivering computing as the 5th utility. and lexicography. Vancouver. Indianapolis. [37] Cheng-Tao Chu. 2006. Graph twiddling in a MapReduce world. Washington. China. Andrew Ng. An empirical study of smoothing techniques for language modeling. MapReduce-Merge: Simplified relational data processing on large clusters. Beijing. and Ivona Brandic. [35] Stanley F. Future Generation Computer Systems. 2009. Tushar Chandra. pages 310–318. Deborah A. In Proceedings of the First ACM Symposium on Cloud Computing (ACM SOCC 2010). Ali Dasdan. 1990. Hsieh. Bigtable: A distributed storage system for structured data. Greece. hype. [40] Brian F. Athens. Chee Shin Yeo. In Proceedings of the Fourth Workshop on Statistical Machine Translation (StatMT ’09). In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI 2006). pages 205–218. Swift: Using distributed disk striping to provide high I/O data rates. [33] Chris Callison-Burch. and Russell Sears. Chen and Joshua Goodman. and Kunle Olukotun. Christof Monz. 2010. pages 1029–1040. Word association norms. Cooper. California. CLOSING REMARKS [31] Rajkumar Buyya. 16(1):22–29. 1991.

Bruce Croft. [47] Jeffrey Dean and Sanjay Ghemawat. [43] David Culler. 2009. Klaus Erik Schauser. Trento. MIT Press. DeWitt. 1992. Eunice Santos. ACM SIGMOD Record. [44] Doug Cutting. Richard Karp. A practical part-of-speech tagger. ACM SIGPLAN Notices. Stonebraker. Randy H. Addison-Wesley. Cormen. Frank Olken. pages 133–140. Rivest. MapReduce: A flexible data processing tool. . Deniz Hastorun. Machine Learning. Multi-domain learning by confidence-weighted parameter combination. pages 137–150. [50] David J. and Thorsten von Eicken.7. Leonard D. Massachusetts. Communications of the ACM. [49] Arthur P. LogP: Towards a realistic model of parallel computation. Nan M. [48] Giuseppe DeCandia. Laird. [45] Jeffrey Dean and Sanjay Ghemawat. Communications of the ACM. Cambridge. Rubin. and Ronald L. Italy. 51(1):107–113. David Patterson. Dynamo: Amazon’s highly available key-value store. and Penelope Sibun. and Werner Vogels. Swami Sivasubramanian. 2008. 53(1):72–77. 1977. Gunavardhan Kakulapati. Charles E. [46] Jeffrey Dean and Sanjay Ghemawat. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007). 2010. Dempster. Julian Kupiec. MapReduce: Simplified data processing on large clusters. Washington. Maximum likelihood from incomplete data via the EM algorithm. Parallel database systems: The future of high performance database systems. [51] David J. Massachusetts. 28(7):1–12. Avinash Lakshman. Alex Pilchin. 2007. pages 205–220. 79:123–149. In Proceedings of the Third Conference on Applied Natural Language Processing. Shapiro. MAPREDUCE AND BEYOND 161 [41] Thomas H. and Donald B. [42] W. Leiserson. Madan Jampani. 35(6):85–98. 1993. Peter Vosshall. San Francisco. [52] Mark Dredze.3. 2004. and Trevor Strohman. Jan Pedersen. Implementation techniques for main memory database systems. 1984. 2010. 1992. and David Wood. 39(1):1–38. Donald Meztler. DeWitt and Jim Gray. MapReduce: Simplified data processing on large clusters. Abhijit Sahay. Ramesh Subramonian. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004). 14(2):1–8. 1990. Journal of the Royal Statistical Society. Katz. Communications of the ACM. California. Introduction to Algorithms. Reading. Michael R. Stevenson. Alex Kulesza. Series B (Methodological). Search Engines: Information Retrieval in Practice. and Koby Crammer.

Aaron Cordova. Finland. CLOSING REMARKS [53] Susan Dumais. [62] Mark S. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003). and Jimmy Lin. The strength of weak ties: A network theory revisited. Chiba. Massachusetts. Proceedings of the National Academy of Science. George Karypis. 2003. Granovetter. 33(2):51–59. Reading. Tampere. Anshul Gupta. The Prague Bulletin of Mathematical Linguistics. Blackwell. 78(6):1360–1380. Web spam taxonomy. A synopsis of linguistic theory 1930–55. In Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008. Bolton Landing. Michele Banko. In Proceedings a o of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). 2010. 93:37–46. pages 39–47. [54] Chris Dyer. [60] Ananth Grama. Alex Mont. Firth. 2005. [61] Mark S. 2002. Community structure in social and biological networks. Special Volume of the Philological Society. and Shun-Tak Leung. The American Journal of Sociology. The strength of weak ties. 2008. Ohio. Jimmy Lin. pages 199–207. Japan. pages 29–43. Fast. In Studies in Linguistic Analysis. [63] Zolt´n Gy¨ngyi and Hector Garcia-Molina. Granovetter. Newman. [59] Michelle Girvan and Mark E. 2003. 2002. [55] John R. Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski. and Vipin Kumar. Eric Brill. pages 1–32. Web question answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). easy. 1973. [58] Seth Gilbert and Nancy Lynch. 1957. The Google File System. Introduction to Parallel Computing. and Andrew Ng. Oxford. Addison-Wesley. Brewer’s Conjecture and the feasibility of consistent. ACM SIGACT News. [56] Qin Gao and Stephan Vogel. .162 CHAPTER 7. and cheap: Construction of statistical machine translation models with MapReduce. available. New York. pages 291–298. 1983. partition-tolerant web services. Howard Gobioff. 2002. Columbus. J. [57] Sanjay Ghemawat. Sociological Theory. 1:201–233. 99(12):7821– 7826.

2007. [74] John Howard. 24(2):8–12. Asilomar. Mathematical Structures of Language. Microsoft Research. Sebastopol. Washington. editors. Jim Gray on eScience: A transformed scientific method. Naga K. pages 260–269. and Kristin Tolle. David Nichols. 2009. The Fourth Paradigm: DataIntensive Scientific Discovery. Mars: A MapReduce framework on graphics processors. [69] Zelig S. Island Networks: Communication. ACM Transactions on Computer Systems. 2008. Beautiful Data. Washington. and Michael West. O’Reilly. 1996. 1968. Kinship. Sherri Menees. 2009. Dallas. Scale and performance in a distributed file system. Wiley. Redmond. 6(1):51–81. 2009. In Proceedings of the Fourth Biennial Conference on Innovative Data Systems Research (CIDR 2009). Stewart Tansley. On designing and deploying Internet-scale services. Toronto. Robert Sidebotham. [72] Tony Hey. 2009. and Classification Structures in Oceania. Microsoft Research. Stewart Tansley. England. Canada. Poland. The Fourth Paradigm: Data-Intensive Scientific Discovery. Information platforms and the rise of the data scientist. [71] Bingsheng He. Mahadev Satyanarayanan. Redmond. and Kristin Tolle. In Toby Segaran and Jeff Hammerbacher. In Proceedings of the 21st Large Installation System Administration Conference (LISA ’07). [67] James Hamilton. [65] Alon Halevy. low power servers for Internet-scale services. . and Kristin Tolle. MAPREDUCE AND BEYOND 163 [64] Per Hage and Frank Harary. New York. Cambridge University Press.7. Cambridge. The unreasonable effectiveness of data. pages 233–244. 2005. [68] Jeff Hammerbacher. California. Texas. In Tony Hey. and Fernando Pereira. Stock market forecasting using hidden Markov models: A new approach. 1988. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost. pages 192–196. Harris. Qiong Luo. [70] Md. California. [66] James Hamilton. Rafiul Hassan and Baikunth Nath. Govindaraju. pages 73–84. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT 2008). Michael Kazar. Peter Norvig.3. Ontario. Wroclaw. [73] Tony Hey. and Tuyong Wang. editors. Communications of the ACM. Stewart Tansley. Wenbin Fang. 2009. In Proceedings of the 5th International Conference on Intelligent Systems Design and Applications (ISDA ’05).

Austin. Cambridge University Press. Statistical Machine Translation. Mihai Budiu. Oregon. Authoritative sources in a hyperlinked environment. 2007. Cambridge. . Pearson. An Introduction to Parallel Algorithms. Tsourakakis. Sierra Michels-Slettvet. [76] Adam Jacobs. [81] U Kang. Statistical phrase-based translation. pages 229–238. 2009. [83] Aaron Kimball. 2008. pages 116–120. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM 2009). and Daniel Marcu. [86] Philipp Koehn. Journal of the ACM. New Jersey. Massachusetts. 2008. Upper Saddle River. Christos Faloutsos. Technical Report CMU-ML-08-117. [84] Jon M. PEGASUS: A peta-scale graph mining system—implementation and observations. [82] Howard Karloff. Andrew Birrell. 2010. Ana Paula Appel. Siddharth Suri. [77] Joseph JaJa. 46(5):604–632. Miami. Edmonton. Texas. [85] Philipp Koehn. In Proceedings of the 39th ACM Technical Symposium on Computer Science Education (SIGCSE 2008). 1999. and Christos Faloutsos. Och. HADI: Fast diameter estimation and mining in massive graphs with Hadoop. Canada. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003). Massachusetts. Cambridge. 2003. 2009. MIT Press. CLOSING REMARKS [75] Michael Isard. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2010).164 CHAPTER 7. and Christophe Bisciglia. Lisbon. Carnegie Mellon University. Portugal. 1997. Statistical methods for speech recognition. England. Addison-Wesley. Martin. Cluster computing for Web-scale data processing. Franz J. ACM Queue. Portland. The pathologies of big data. Charalampos E. Speech and Language Processing. School of Computer Science. pages 59–72. Floria. Reading. 7(6). [78] Frederick Jelinek. Alberta. and Sergei Vassilvitskii. A model of computation for MapReduce. 2010. 2009. [79] Daniel Jurafsky and James H. 1992. Kleinberg. Charalampos Tsourakakis. and Jure Leskovec. [80] U Kang. and Dennis Fetterly. pages 48–54. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007). Yuan Yu.

Dong C. Exploring large-data issues in the curriculum: A case study with MapReduce. Jorge Nocedal. pages 54–61. pages 756–764. On the limited memory BFGS method for large scale optimization. Anand Bahety.3. [94] Jimmy Lin. pages 419–428. ACM Transactions on Information Systems.7. pages 282–289. [95] Jimmy Lin. [90] Abby Levenberg and Miles Osborne. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Honolulu. 2008. [93] Jimmy Lin. [88] Ronny Lempel and Shlomo Moran. California. Ohio. and Miles Osborne. 2001. 2007. 40(3):1– 49. [96] Dong C. Chris Callison-Burch. SALSA: The Stochastic Approach for LinkStructure Analysis. Stream-based translation models for statistical machine translation. January 2009. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). MAPREDUCE AND BEYOND 165 [87] John D. In Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08) at ACL 2008. [97] Adam Lopez. 1989. 2008. Liu. 2008. University of Maryland. . ACM Queue. Technical Report HCIL-2009-01. 2010. Maryland. ACM Transactions on Information Systems. 7(11). An exploration of the principles underlying redundancy-based factoid question answering. 2001. [92] Jimmy Lin. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Stream-based randomised language models for SMT. Hawaii. 2009. 27(2):1–55. Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Triple-parity RAID and beyond. 45(3):503–528. Statistical machine translation. and Samantha Mahindrakar. [89] Abby Levenberg. Shravya Konda. 2009. Liu. [91] Adam Leventhal. Los Angeles. Singapore. California. College Park. 19(2):131–160. and Fernando Pereira. and Jorge Nocedal. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. San Francisco. Columbus. Mathematical Programming B. Lafferty. Lowlatency. ACM Computing Surveys. high-throughput access to static global resources within the Hadoop framework. Andrew McCallum.

Calgary. pages 219–226. Alberta. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Hang Cui. McCool. ACM Queue. Foundations of Statistical Natural u Language Processing. Mardis. Austern. Berkeley. 2009. A hidden Markov model information retrieval system. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). Improving memory hierarchy performance for irregular applications using data and computation reorderings. 2008. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999). [107] Donald Metzler. Bik. and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. Naty Leiser. James C. 7(7). and Ken Kennedy. 2010. An Introducu tion to Information Retrieval. The impact of next-generation sequencing technology on genetics. [104] Michael D. Scalable programming models for massively multicore processors. GFS: Evolution on fast-forward. Dehnert. Schwartz. Cambridge. 1999. page 6. 2009. Dehnert. C. . Taipei. CLOSING REMARKS [98] Grzegorz Malewicz. [105] Marshall K. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002). Massachusetts. Building enriched document representations using aggregated anchor text. Aart J. Miller. 2009. A comparison of algorithms for maximum entropy parameter estimation. 29(3):217–247. Proceedings of the IEEE. Naty Leiser. Cambridge University Press. Canada. C. Ilan Horn. 2008. Indiana. and Hinrich Sch¨tze. Matthew H. McKusick and Sean Quinlan. California. 2008. 24(3):133–141. Indianapolis. pages 214–221. 1999. [100] Robert Malouf. England. [108] David R. Ilan Horn. MIT Press. 2001. Manning and Hinrich Sch¨tze. Aart J. Pregel: A system for largescale graph processing. [102] Christopher D. Jasmine Novak. David Whalley. 96(5):816–831. Matthew H. and Srihari Reddy. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC 2009). Tim Leek. 2002. Trends in Genetics. [106] John Mellor-Crummey. H. Taiwan. Prabhakar Raghavan. [101] Christopher D.166 CHAPTER 7. and Grzegorz Czajkowski. Manning. Cambridge. pages 49–55. James C. Bik. [99] Grzegorz Malewicz. [103] Elaine R. Austern. International Journal of Parallel Programming. and Richard M.

2003. D. Daniel J. [111] Daniel Nurmi. A comparison of approaches to . Opinion mining and sentiment analysis. Using maximum entropy for text classification. Mahout in Action. John Lafferty. The PageRank citation ranking: Bringing order to the Web. and Andrew McCallum. Washington. The Eucalyptus open-source cloudcomputing system. 2008. Sweden. pages 1099–1110. Utkarsh Srivastava. and Andrew Tomkins. ACM Queue. and Terry Winograd.C. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). Washington.7. Samuel Madden. Seattle. [118] Bo Pang and Lillian Lee. and Michael Stonebraker. 29(1):19–51. Foundations and Trends in Information Retrieval. [119] David A. 2010. pages 348–355. and Justin Zobel. Och and Hermann Ney. Pig Latin: A not-so-foreign language for data processing. [112] Franz J. Stockholm. A systematic comparison of various statistical alignment models. [117] Lawrence Page. Erik Paulson. British Columbia. Alexander Rasin. William Webber. Sunil Soman. [114] Christopher Olston.. The future of microprocessors. [116] Sean Owen and Robin Anil. Chris Grzegorczyk. pages 61–67. Load balancing for termdistributed parallel retrieval. Vancouver. Web crawling. 52(1):105. 2008. [115] Kunle Olukotun and Lance Hammond. [120] Andrew Pavlo. Graziano Obertelli. The data center is the computer. Benjamin Reed. [113] Christopher Olston and Marc Najork. 2009. Ravi Kumar. 4(3):175–246. Canada. Greenwich. Stanford University. Communications of the ACM. 2(1–2):1–135. Rajeev Motwani. Connecticut. MAPREDUCE AND BEYOND 167 [109] Alistair Moffat. 1999. Foundations and Trends in Information Retrieval. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. Abadi. pages 124–131. [110] Kamal Nigam. and Dmitrii Zagorodnov. Computational Linguistics. Sergey Brin. Manning Publications Co. 2010. 1999. 2008. Lamia Youseff. DeWitt. Stanford Digital Library Working Paper SIDL-WP-1999-0120. Patterson. David J. 3(7):27–34. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. 2005.. 2006. Rich Wolski.3.

Supporting MapReduce on large-scale asymmetric multi-core clusters. [124] Xiaoguang Qi and Brian D. Wolf-Dietrich Weber. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). Davison. Arizona. and Christos Kozyrakis. In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data. Ramanan Raghuraman. Wiley. 2004. California. ACM Operating Systems Review. 1990. Singapore. . Benjamin Rose. Providence. [130] Sheldon M. Rappa. [122] Rob Pike. Robert Griesemer. California. pages 267–296. California. Stochastic processes. Butt. 1996. 2007. Mustafa Rafique. IBM Systems Journal. ACM Computing Surveys. San Jose. CLOSING REMARKS large-scale data analysis. New York. and Sean Quinlan. 2009. Evaluating MapReduce for multi-core and multiprocessor systems. 2005. Ross. Interpreting the data: Parallel analysis with Sawzall. Nikolopoulos. [121] Sasa Petrovic. In Proceedings of the ACL/IJCNLP 2009 Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-4). and Victor Lavrenko. 13(4):277– 298. 2009. Ranking and semi-supervised classification on large scale graphs using Map-Reduce. Web page classification: Features and algorithms. 2009. Miles Osborne. Rabiner. Ali R. and Luiz Andr´ Barroso. [126] M. In Readings in Speech Recognition.168 CHAPTER 7. Streaming first story detection with application to Twitter. [125] Lawrence R. [129] Michael A. Arun Penmetsa. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA 2007). pages 205–218. Phoenix. The utility business model and the future of computing services. Rhode Island. Los Angeles. 2008. pages 165–178. Failure trends e in a large disk drive population. Gary Bradski. [123] Eduardo Pinheiro. Sean Dorward. San Francisco. 43(2):25–34. Scientific Programming Journal. 2009. In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). 41(2). and Dimitrios S. [128] Delip Rao and David Yarowsky. [127] Colby Ranger. 2010. Morgan Kaufmann Publishers. A tutorial on hidden Markov models and selected applications in speech recognition. 34(1):32–42.

pages 37–42. A cooccurrence-based thesaurus and two u applications to information retrieval. [136] Hinrich Sch¨tze. pages 193–204. Classification and Use.tut04.3. pages 134–141. Seattle. and Ronald Rosenfeld. Monterey. Named Entities: Recognition. University of Maryland. Andrew Mccallum. 1998.pdf. pages 110–121. 2004. In Proceedings of the First USENIX Conference on File and Storage Technologies. Amsterdam. Canada. John Benjamins. Portland. The Netherlands. Alberta. Shallow parsing with conditional random fields. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data. DeWitt. Log-linear models. http://www. Seattle. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). Information Processing and Management. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). Florida. PhD thesis. Washington. . 2009. Edmonton. 1989. MAPREDUCE AND BEYOND 169 [131] Thomas Sandholm and Kevin Lai. Pedersen.cs. 2009. 33(3):307–318. MapReduce optimization using regulated dynamic prioritization. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003). [140] Fei Sha and Fernando Pereira. Orlando. Automatic word sense discrimination. Schneider and David J. [139] Kristie Seymore. [133] Frank Schmuck and Roger Haskin. [138] Satoshi Sekine and Elisabete Ranchhod. A performance evaluation of four parallel join algorithms in a shared-noth