Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

with MapReduce

Synthesis Lectures on Human

Language Technologies

Editor

Graeme Hirst, University of Toronto

Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University

of Toronto. The series consists of 50- to 150-page monographs on topics relating to natural

language processing, computational linguistics, information retrieval, and spoken language

understanding. Emphasis is on important newtechniques, on newapplications, and on topics that

combine two or more HLTsubﬁelds.

Data-Intensive Text Processing with MapReduce

Jimmy Lin and Chris Dyer

2010

Semantic Role Labeling

Martha Palmer, Daniel Gildea, and Nianwen Xue

2010

Spoken Dialogue Systems

Kristiina Jokinen and Michael McTear

2009

Introduction to Chinese Natural Language Processing

Kam-Fai Wong, Wenjie Li, Ruifeng Xu, and Zheng-sheng Zhang

2009

Introduction to Linguistic Annotation andText Analytics

GrahamWilcock

2009

Dependency Parsing

Sandra Kübler, Ryan McDonald, and JoakimNivre

2009

Statistical Language Models for Information Retrieval

ChengXiang Zhai

2008

Copyright © 2010 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in

any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in

printed reviews, without the prior permission of the publisher.

Data-Intensive Text Processing with MapReduce

Jimmy Lin and Chris Dyer

www.morganclaypool.com

ISBN: 9781608453429 paperback

ISBN: 9781608453436 ebook

DOI 10.2200/S00274ED1V01Y201006HLT007

A Publication in the Morgan & Claypool Publishers series

SYNTHESIS LECTURES ON HUMAN LANGUAGETECHNOLOGIES

Lecture #7

Series Editor: Graeme Hirst, University of Toronto

Series ISSN

Synthesis Lectures on Human Language Technologies

Print 1947-4040 Electronic 1947-4059

Data-Intensive Text Processing

with MapReduce

Jimmy Lin and Chris Dyer

University of Maryland

SYNTHESIS LECTURES ON HUMAN LANGUAGETECHNOLOGIES #7

C

M

&

cLaypool Morgan publishers

&

ABSTRACT

Our world is being revolutionized by data-driven methods: access to large amounts of data has gen-

erated new insights and opened exciting new opportunities in commerce, science, and computing

applications. Processing the enormous quantities of data necessary for these advances requires large

clusters, making distributed computing paradigms more crucial than ever. MapReduce is a program-

ming model for expressing distributed computations onmassive datasets and anexecutionframework

for large-scale data processing on clusters of commodity servers. The programming model provides

an easy-to-understand abstraction for designing scalable algorithms, while the execution framework

transparently handles many system-level details, ranging fromscheduling to synchronization to fault

tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing

algorithms common in natural language processing, information retrieval, and machine learning.

We introduce the notion of MapReduce design patterns, which represent general reusable solutions

to commonly occurring problems across a variety of problem domains. This book not only intends

to help the reader “think in MapReduce”, but also discusses limitations of the programming model

as well.

KEYWORDS

Hadoop, parallel and distributed programming, algorithm design, text processing, nat-

ural language processing, information retrieval, machine learning

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Computing in the Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Big Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Why Is This Different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2

MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 The Distributed File System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3

MapReduce AlgorithmDesign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Combiners and In-Mapper Combining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.2 Algorithmic Correctness with Local Aggregation. . . . . . . . . . . . . . . . . . . . . . . 43

3.2 Pairs and Stripes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Secondary Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.1 Reduce-Side Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.2 Map-Side Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

viii

3.5.3 Memory-Backed Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4

Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Inverted Indexing: Revised Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Index Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.5.1 Byte-Aligned and Word-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5.2 Bit-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.3 Postings Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.6 What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.7 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5

Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.5 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6

EMAlgorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1.2 A Latent Variable Marble Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.1.3 MLE with Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1.4 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.1.5 An EM Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2.1 Three Questions for Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 115

CONTENTS ix

6.2.2 The Forward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.2.3 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.2.4 Parameter Estimation for HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.2.5 Forward-Backward Training: Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.3.1 HMM Training in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.4 Case Study: Word Alignment for Statistical Machine Translation . . . . . . . . . . . . . 130

6.4.1 Statistical Phrase-Based Translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.4.2 Brief Digression: Language Modeling with MapReduce . . . . . . . . . . . . . . . 133

6.4.3 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.5 EM-Like Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6.5.1 Gradient-Based Optimization and Log-Linear Models . . . . . . . . . . . . . . . . 138

6.6 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Alternative Computing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.3 MapReduce and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Acknowledgments

The ﬁrst author is grateful to Esther and Kiri for their loving support. He dedicates this book

to Joshua and Jacob, the new joys of his life.

The second author would like to thank Herb for putting up with his disorderly living habits

and Philip for being a very indulgent linguistics advisor.

This work was made possible by the Google and IBM Academic Cloud Computing Initiative

(ACCI) and the National Science Foundation’s Cluster Exploratory (CLuE) program, under award

IIS-0836560, andalso awardIIS-0916043. Any opinions, ﬁndings, conclusions, or recommendations

expressed in this book are those of the authors and do not necessarily reﬂect the views of the sponsors.

We are grateful to Jeff Dean, Miles Osborne,TomWhite, as well as numerous other individuals

who have commented on earlier drafts of this book.

Jimmy Lin and Chris Dyer

May 2010

1

C H A P T E R 1

Introduction

MapReduce [45] is a programming model for expressing distributed computations on massive

amounts of data and an execution framework for large-scale data processing on clusters of com-

modity servers. It was originally developed by Google and built on well-known principles in parallel

and distributed processing dating back several decades. MapReduce has since enjoyed widespread

adoption via an open-source implementation called Hadoop, whose development was led by Yahoo

(now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with

signiﬁcant activity in both industry and academia.

This book is about scalable approaches to processing large amounts of text with MapReduce.

Given this focus, it makes sense to start with the most basic question: Why? There are many answers

to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that

real-world systems must grapple with. Second, across a wide range of text processing applications,

more data translates into more effective algorithms, and thus it makes sense to take advantage of the

plentiful amounts of data that surround us.

Modern information societies are deﬁned by vast repositories of data, both public and private.

Therefore, any practical application must be able to scale up to datasets of interest. For many, this

means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around

gathering, analyzing, monitoring, ﬁltering, searching, or organizing web content must tackle large-

data problems: “web-scale” processing is practically synonymous with data-intensive processing. This

observation applies not only to well-established internet companies, but also countless startups and

niche players as well. Just think, how many companies do you know that start their pitch with “we’re

going to harvest information on the web and…”?

Another strong area of growth is the analysis of user behavior data. Any operator of a moder-

ately successful website can record user activity and in a matter of weeks (or sooner) be drowning in

a torrent of log data. In fact, logging user behavior generates so much data that many organizations

simply can’t cope with the volume, and either turn the functionality off or throw away data after

some time. This represents lost opportunities, as there is a broadly held belief that great value lies in

insights derived from mining such data. Knowing what users look at, what they click on, how much

time they spend on a web page, etc., leads to better business decisions and competitive advantages.

Broadly, this is known as business intelligence, which encompasses a wide range of technologies

including data warehousing, data mining, and analytics.

Howmuch data are we talking about? Afewexamples: Google grewfromprocessing 100 tera-

bytes of data a day with MapReduce in 2004 [45] to processing 20 petabytes a day with MapReduce

2 1. INTRODUCTION

in 2008 [46]. In April 2009, a blog post

1

was written about eBay’s two enormous data warehouses:

one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion

records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed

2

similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes

per day. Petabyte datasets are rapidly becoming the norm, and the trends are clear: our ability to

store data is fast overwhelming our ability to process what we store. More distressing, increases in

capacity are outpacing improvements in bandwidth such that our ability to even read back what we

store is deteriorating [91]. Disk capacities have grown from tens of megabytes in the mid-1980s

to about a couple of terabytes today (several orders of magnitude). On the other hand, latency and

bandwidth have improved relatively little: in the case of latency, perhaps 2×improvement during the

last quarter century, and in the case of bandwidth, perhaps 50×. Given the tendency for individuals

and organizations to continuously ﬁll up whatever capacity is available, large-data problems are

growing increasingly severe.

Moving beyond the commercial sphere, many have recognized the importance of data man-

agement in many scientiﬁc disciplines, where petabyte-scale datasets are also becoming increasingly

common [21]. For example:

• The high-energy physics community was already describing experiences with petabyte-scale

databases back in 2005 [20]. Today, the Large Hadron Collider (LHC) near Geneva is the

world’s largest particle accelerator, designed to probe the mysteries of the universe, including

the fundamental nature of matter, by recreating conditions shortly following the Big Bang.

When it becomes fully operational, the LHCwill produce roughly 15 petabytes of data a year.

3

• Astronomers have long recognizedthe importance of a“digital observatory” that wouldsupport

the data needs of researchers across the globe—the Sloan Digital Sky Survey [145] is perhaps

the most well known of these projects. Looking into the future, the Large Synoptic Survey

Telescope (LSST) is a wide-ﬁeld instrument that is capable of observing the entire sky every

few days. When the telescope comes online around 2015 in Chile, its 3.2 gigapixel primary

camera will produce approximately half a petabyte of archive images every month [19].

• The advent of next-generation DNA sequencing technology has created a deluge of sequence

data that needs to be stored, organized, and delivered to scientists for further study. Given

the fundamental tenant in modern genetics that genotypes explain phenotypes, the impact

of this technology is nothing less than transformative [103]. The European Bioinformatics

Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, has

increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Scientists

are predicting that, in the not-so-distant future, sequencing an individual’s genome will be no

more complex than getting a blood test today—ushering a new era of personalized medicine,

where interventions can be speciﬁcally targeted for an individual.

1

http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/

2

http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/

3

http://public.web.cern.ch/public/en/LHC/Computing-en.html

3

Increasingly, scientiﬁc breakthroughs will be powered by advanced computing capabilities that help

researchers manipulate, explore, and mine massive datasets [72]—this has been hailed as the emerg-

ing “fourth paradigm” of science [73] (complementing theory, experiments, and simulations). In

other areas of academia, particularly computer science, systems and algorithms incapable of scaling

to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility.

Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity, not

merely a luxury or curiosity.

Although large data comes in a variety of forms, this book is primarily concerned with pro-

cessing large amounts of text, but touches on other types of data as well (e.g., relational and graph

data). The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural

language processing (NLP) and information retrieval (IR). Recent work in these ﬁelds is dominated

by a data-driven, empirical approach, typically involving algorithms that attempt to capture statistical

regularities in data for the purposes of some task or application. There are three components to this

approach: data, representations of the data, and some method for capturing regularities in the data.

Data are called corpora (singular, corpus) by NLP researchers and collections by those from the IR

community. Aspects of the representations of the data are called features, which may be “superﬁcial”

and easy to extract, such as the words and sequences of words themselves, or “deep” and more difﬁcult

to extract, such as the grammatical relationship between words. Finally, algorithms or models are

applied to capture regularities in the data in terms of the extracted features for some application. One

common application, classiﬁcation, is to sort text into categories. Examples include: Is this email

spam or not spam? Is this word part of an address or a location? The ﬁrst task is easy to understand,

while the second task is an instance of what NLP researchers call named-entity detection [138],

which is useful for local search and pinpointing locations on maps. Another common application is to

rank texts according to some criteria—search is a good example, which involves ranking documents

by relevance to the user’s query. Another example is to automatically situate texts along a scale of

“happiness”, a task known as sentiment analysis or opinion mining [118], which has been applied to

everything from understanding political discourse in the blogosphere to predicting the movement

of stock prices.

There is a growing body of evidence, at least in text processing, that of the three components

discussed above (data, features, algorithms), data probably matters the most. Superﬁcial word-level

features coupled with simple models in most cases trump sophisticated models with deeper features

and less data. But why can’t we have our cake and eat it too? Why not both sophisticated models and

deep features applied to lots of data? Because inference over sophisticated models and extraction of

deep features are often computationally intensive, they don’t scale well.

Consider a simple task such as determining the correct usage of easily confusable words such

as “than” and “then” in English. One can view this as a supervised machine learning problem: we

can train a classiﬁer to disambiguate between the options, and then apply the classiﬁer to new

instances of the problem (say, as part of a grammar checker). Training data is fairly easy to come

by—we can just gather a large corpus of texts and assume that most writers make correct choices

4 1. INTRODUCTION

(the training data may be noisy, since people make mistakes, but no matter). In 2001, Banko and

Brill [14] published what has become a classic paper in natural language processing exploring the

effects of training data size on classiﬁcation accuracy, using this task as the speciﬁc example. They

explored several classiﬁcation algorithms (the exact ones aren’t important, as we shall see), and not

surprisingly, found that more data led to better accuracy. Across many different algorithms, the

increase in accuracy was approximately linear in the log of the size of the training data. Furthermore,

with increasing amounts of training data, the accuracy of different algorithms converged, such that

pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale.

This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms

really don’t matter, all that matters is the amount of data you have. This resulted in an even more

controversial recommendation, delivered somewhat tongue-in-cheek: we should just give up working

on algorithms and simply spend our time gathering data (while waiting for computers to become

faster so we can process the data).

As another example, consider the problem of answering short, fact-based questions such as

“Who shot AbrahamLincoln?” Instead of returning a list of documents that the user would then have

to sort through, a question answering (QA) system would directly return the answer: John Wilkes

Booth. This problem gained interest in the late 1990s, when natural language processing researchers

approached the challenge with sophisticated linguistic processing techniques such as syntactic and

semantic analysis. Around 2001, researchers discovered a far simpler approach to answering such

questions based on pattern matching [27; 53; 92]. Suppose you wanted the answer to the above

question. As it turns out, you can simply search for the phrase “shot Abraham Lincoln” on the web

and look for what appears to its left. Or better yet, look through multiple instances of this phrase

and tally up the words that appear to the left. This simple strategy works surprisingly well, and has

become known as the redundancy-based approach to question answering. It capitalizes on the insight

that in a very large text collection (i.e., the web), answers to commonly asked questions will be stated

in obvious ways, such that pattern-matching techniques sufﬁce to extract answers accurately.

Yet another example concerns smoothing in web-scale language models [25]. A language

model is a probability distribution that characterizes the likelihood of observing a particular sequence

of words, estimated from a large corpus of texts. They are useful in a variety of applications, such as

speechrecognition(todetermine what the speaker is more likely tohave said) andmachine translation

(to determine which of possible translations is the most ﬂuent, as we will discuss in Section 6.4). Since

there are inﬁnitely many possible strings, and probabilities must be assigned to all of them, language

modeling is a more challenging task than simply keeping track of which strings were seen how many

times: some number of likely strings will never be encountered, even with lots and lots of training

data! Most modern language models make the Markov assumption: in a n-gram language model,

the conditional probability of a word is given by the n −1 previous words. Thus, by the chain rule,

the probability of a sequence of words can be decomposed into the product of n-gram probabilities.

Nevertheless, an enormous number of parameters must still be estimated from a training corpus:

potentially V

n

parameters, where V is the number of words in the vocabulary. Even if we treat

5

every word on the web as the training corpus from which to estimate the n-gram probabilities, most

n-grams—in any language, even English—will never have been seen. To cope with this sparseness,

researchers have developed a number of smoothing techniques [35; 79; 102], which all share the basic

idea of moving probability mass from observed to unseen events in a principled manner. Smoothing

approaches vary in effectiveness, both in terms of intrinsic and application-speciﬁc metrics. In 2007,

Brants et al. [25] described language models trained on up to two trillion words.

4

Their experiments

compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique

the authors affectionately referredto as “stupidbackoff ”.

5

Not surprisingly, stupidbackoff didn’t work

as well as Kneser-Ney smoothing on smaller corpora. However, it was simpler and could be trained

on more data, which ultimately yielded better language models. That is, a simpler technique on more

data beat a more sophisticated technique on less data.

Recently, three Google researchers summarized this data-driven philosophy in an essay titled

The Unreasonable Effectiveness of Data [65].

6

Why is this so? It boils down to the fact that language

in the wild, just like human behavior in general, is messy. Unlike, say, the interaction of subatomic

particles, human use of language is not constrained by succinct, universal “laws of grammar”. There

are of course rules that govern the formation of words and sentences—for example, that verbs appear

before objects in English, and that subjects and verbs must agree in number in many languages—but

real-world language is affected by a multitude of other factors as well: people invent new words and

phrases all the time, authors occasionally make mistakes, groups of individuals write within a shared

context, etc. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story

about a ﬁctional society in which the art of cartography had gotten so advanced that their maps were

as big as the lands they were describing.

7

The world, he would say, is the best description of itself. In

the same way, the more observations we gather about language use, the more accurate a description

we have of language itself. This, in turn, translates into more effective algorithms and systems.

So, in summary, why large data? In some ways, the ﬁrst answer is similar to the reason people

climb mountains: because they’re there. But the second answer is even more compelling. Data

represent the rising tide that lifts all boats—more data lead to better algorithms and systems for

solving real-world problems. Now that we’ve addressed the why, let’s tackle the how. Let’s start with

the obvious observation: data-intensive processing is beyond the capability of any individual machine

and requires clusters—which means that large-data problems are fundamentally about organizing

computations on dozens, hundreds, or even thousands of machines. This is exactly what MapReduce

does, and the rest of this book is about the how.

4

As an aside, it is interesting to observe the evolving deﬁnition of large over the years. Banko and Brill’s paper in 2001 was titled

Scaling to Very Very Large Corpora for Natural Language Disambiguation, and dealt with a corpus containing a billion words.

5

As in, so stupid it couldn’t possibly work.

6

This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [155]. This

is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural

phenomena, which is the exact opposite of the data-driven approach.

7

On Exactitude in Science [23]. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893).

6 1. INTRODUCTION

1.1 COMPUTINGINTHECLOUDS

For better or for worse, it is often difﬁcult to untangle MapReduce and large-data processing from

the broader discourse on cloud computing. True, there is substantial promise in this new paradigm

of computing, but unwarranted hype by the media and popular sources threatens its credibility in

the long run. In some ways, cloud computing is simply brilliant marketing. Before clouds, there were

grids,

8

and before grids, there were vector supercomputers, each having claimed to be the best thing

since sliced bread.

So what exactly is cloud computing? This is one of those questions where 10 experts will give

11 different answers; in fact, countless papers have been written simply to attempt to deﬁne the term

(e.g., [9; 31; 149], just to name a few examples). Here we offer up our own thoughts and attempt to

explain how cloud computing relates to MapReduce and data-intensive processing.

At the most superﬁcial level, everything that used to be called web applications has been

rebranded to become “cloud applications”, which includes what we have previously called “Web

2.0” sites. In fact, anything running inside a browser that gathers and stores user-generated content

now qualiﬁes as an example of cloud computing. This includes social-networking services such

as Facebook, video-sharing sites such as YouTube, web-based email services such as Gmail, and

applications such as Google Docs. In this context, the cloud simply refers to the servers that power

these sites, and user data is said to reside “in the cloud”. The accumulation of vast quantities of user

data creates large-data problems, many of which are suitable for MapReduce. To give two concrete

examples: a social-networking site analyzes connections in the enormous globe-spanning graph of

friendships to recommend new connections. An online email service analyzes messages and user

behavior to optimize ad selection and placement. These are all large-data problems that have been

tackled with MapReduce.

9

Another important facet of cloud computing is what’s more precisely known as utility com-

puting [31; 129]. As the name implies, the idea behind utility computing is to treat computing

resource as a metered service, like electricity or natural gas. The idea harkens back to the days of

time-sharing machines, and in truth isn’t very different from this antiquated form of computing.

Under this model, a “cloud user” can dynamically provision any amount of computing resources

from a “cloud provider” on demand and only pay for what is consumed. In practical terms, the user

is paying for access to virtual machine instances that run a standard operating system such as Linux.

Virtualization technology (e.g., [15]) is used by the cloud provider to allocate available physical

8

What is the difference between cloud computing and grid computing? Although both tackle the fundamental problemof howbest

to bring computational resources to bear on large and difﬁcult problems, they start with different assumptions. Whereas clouds

are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number

of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous

resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that

are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple

administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these challenges.

9

The ﬁrst example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The second is, of course,

Google, which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and

placement.

1.1. COMPUTINGINTHECLOUDS 7

resources and enforce isolation between multiple users who may be sharing the same hardware.

Once one or more virtual machine instances have been provisioned, the user has full control over the

resources and can use them for arbitrary computation. Virtual machines that are no longer needed

are destroyed, thereby freeing up physical resources that can be redirected to other users. Resource

consumption is measured in some equivalent of machine-hours and users are charged in increments

thereof.

Both users and providers beneﬁt in the utility computing model. Users are freed from upfront

capital investments necessary to build datacenters and substantial reoccurring costs in maintaining

them. They also gain the important property of elasticity—as demand for computing resources grow,

for example, from an unpredicted spike in customers, more resources can be seamlessly allocated

from the cloud without an interruption in service. As demand falls, provisioned resources can be

released. Prior to the advent of utility computing, coping with unexpected spikes in demand was

fraught with challenges: under-provision and run the risk of service interruptions, or over-provision

and tie up precious capital in idle machines that are depreciating.

Fromthe utility provider point of view, this business also makes sense because large datacenters

beneﬁt fromeconomies of scale and can be run more efﬁciently than smaller datacenters. In the same

way that insurance works by aggregating risk and redistributing it, utility providers aggregate the

computing demands for a large number of users. Although demand may ﬂuctuate signiﬁcantly for

each user, overall trends in aggregate demand should be smooth and predictable, which allows the

cloud provider to adjust capacity over time with less risk of either offering too much (resulting in

inefﬁcient use of capital) or too little (resulting in unsatisﬁed customers). In the world of utility

computing, Amazon Web Services currently leads the way and remains the dominant player, but a

number of other cloud providers populate a market that is becoming increasingly crowded. Most

systems are based on proprietary infrastructure, but there is at least one, Eucalyptus [111], that

is available open source. Increased competition will beneﬁt cloud users, but what direct relevance

does this have for MapReduce? The connection is quite simple: processing large amounts of data

with MapReduce requires access to clusters with sufﬁcient capacity. However, not everyone with

large-data problems can afford to purchase and maintain clusters. This is where utility computing

comes in: clusters of sufﬁcient size can be provisioned only when the need arises, and users pay only

as much as is required to solve their problems. This lowers the barrier to entry for data-intensive

processing and makes MapReduce much more accessible.

A generalization of the utility computing concept is “everything as a service”, which is itself a

new take on the age-old idea of outsourcing. A cloud provider offering customers access to virtual

machine instances is said to be offering infrastructure as a service, or IaaS for short. However, this

may be too lowlevel for many users. Enter platformas a service (PaaS), which is a rebranding of what

used to be called hosted services in the “pre-cloud” era. Platform is used generically to refer to any

set of well-deﬁned services on top of which users can build applications, deploy content, etc. This

class of services is best exempliﬁed by Google App Engine, which provides the backend datastore

and API for anyone to build highly scalable web applications. Google maintains the infrastructure,

8 1. INTRODUCTION

freeing the user from having to backup, upgrade, patch, or otherwise maintain basic services such

as the storage layer or the programming environment. At an even higher level, cloud providers can

offer software as a service (SaaS), as exempliﬁed by Salesforce, a leader in customer relationship

management (CRM) software. Other examples include outsourcing an entire organization’s email

to a third party, which is commonplace today.

What does this proliferation of services have to do with MapReduce? No doubt that “ev-

erything as a service” is driven by desires for greater business efﬁciencies, but scale and elasticity

play important roles as well. The cloud allows seamless expansion of operations without the need

for careful planning and supports scales that may otherwise be difﬁcult or cost-prohibitive for an

organization to achieve. Cloud services, just like MapReduce, represent the search for an appro-

priate level of abstraction and beneﬁcial divisions of labor. IaaS is an abstraction over raw physical

hardware—an organization might lack the capital, expertise, or interest in running datacenters, and

therefore pays a cloud provider to do so on its behalf. The argument applies similarly to PaaS and

SaaS. In the same vein, the MapReduce programming model is a powerful abstraction that separates

the what from the how of data-intensive processing.

1.2 BIGIDEAS

Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional

models of computing. In this section, we discuss a number of “big ideas” behind MapReduce. To be

fair, all of these ideas have been discussed in the computer science literature for some time (some for

decades), and MapReduce is certainly not the ﬁrst to adopt these ideas. Nevertheless, the engineers

at Google deserve tremendous credit for pulling these various threads together and demonstrating

the power of these ideas on a scale previously unheard of.

Scale “out”, not “up”. For data-intensive workloads, a large number of commodity low-end servers

(i.e., the scaling “out” approach) is preferred over a small number of high-end servers (i.e., the scaling

“up” approach).The latter approachof purchasing symmetric multi-processing (SMP) machines with

a large number of processor sockets (dozens, even hundreds) and a large amount of shared memory

(hundreds or even thousands of gigabytes) is not cost effective, since the costs of such machines do

not scale linearly (i.e., a machine with twice as many processors is often signiﬁcantly more than twice

as expensive). On the other hand, the low-end server market overlaps with the high-volume desktop

computing market, which has the effect of keeping prices low due to competition, interchangeable

components, and economies of scale.

Barroso and Hölzle’s recent treatise of what they dubbed “warehouse-scale computers” [18]

contains a thoughtful analysis of the two approaches. The Transaction Processing Council (TPC)

is a neutral, non-proﬁt organization whose mission is to establish objective database benchmarks.

Benchmark data submitted to that organization are probably the closest one can get to a fair “apples-

to-apples” comparison of cost and performance for speciﬁc, well-deﬁned relational processing appli-

cations. Based on TPC-C benchmark results from late 2007, a low-end server platform is about four

1.2. BIGIDEAS 9

times more cost efﬁcient than a high-end shared memory platformfromthe same vendor. Excluding

storage costs, the price/performance advantage of the low-end server increases to about a factor of

twelve.

What if we take into account the fact that communication between nodes in a high-end

SMP machine is orders of magnitude faster than communication between nodes in a commodity

network-based cluster? Since workloads today are beyond the capability of any single machine (no

matter how powerful), the comparison is more accurately between a smaller cluster of high-end

machines and a larger cluster of low-end machines (network communication is unavoidable in both

cases). Barroso and Hölzle model these two approaches under workloads that demand more or

less communication, and conclude that a cluster of low-end servers approaches the performance of

the equivalent cluster of high-end servers—the small performance gap is insufﬁcient to justify the

price premium of the high-end servers. For data-intensive applications, the conclusion appears to

be clear: scaling “out” is superior to scaling “up”, and therefore most existing implementations of the

MapReduce programming model are designed around clusters of low-end commodity servers.

Capital costs in acquiring servers is, of course, only one component of the total cost of de-

livering computing capacity. Operational costs are dominated by the cost of electricity to power

the servers as well as other aspects of datacenter operations that are functionally related to power:

power distribution, cooling, etc. [18; 67]. As a result, energy efﬁciency has become a key issue in

building warehouse-scale computers for large-data processing. Therefore, it is important to factor in

operational costs when deploying a scale-out solution based on large numbers of commodity servers.

Datacenter efﬁciency is typically factored into three separate components that can be inde-

pendently measured and optimized [18]. The ﬁrst component measures how much of a building’s

incoming power is actually delivered to computing equipment, and correspondingly, how much is

lost to the building’s mechanical systems (e.g., cooling, air handling) and electrical infrastructure

(e.g., power distribution inefﬁciencies). The second component measures how much of a server’s

incoming power is lost to the power supply, cooling fans, etc. The third component captures how

much of the power delivered to computing components (processor, RAM, disk, etc.) is actually used

to perform useful computations.

Of the three components of datacenter efﬁciency, the ﬁrst two are relatively straightforward

to objectively quantify. Adoption of industry best-practices can help datacenter operators achieve

state-of-the-art efﬁciency. The third component, however, is much more difﬁcult to measure. One

important issue that has been identiﬁed is the non-linearity between load and power draw. That is,

a server at 10% utilization may draw slightly more than half as much power as a server at 100%

utilization (which means that a lightly loaded server is much less efﬁcient than a heavily loaded

server). A survey of ﬁve thousand Google servers over a six-month period shows that servers operate

most of the time at between 10% and 50% utilization [17], which is an energy-inefﬁcient operating

region. As a result, Barroso and Hölzle have advocated for research and development in energy-

proportional machines, where energy consumption would be proportional to load, such that an

10 1. INTRODUCTION

idle processor would (ideally) consume no power, but yet retain the ability to power up (nearly)

instantaneously in response to demand.

Although we have provided a brief overviewhere, datacenter efﬁciency is a topic that is beyond

the scope of this book. For more details, consult Barroso and Hölzle [18] and Hamilton [67], who

provide detailed cost models for typical modern datacenters. However, even factoring in operational

costs, evidence suggests that scaling out remains more attractive than scaling up.

Assume failures are common. At warehouse scale, failures are not only inevitable, but commonplace.

A simple calculation sufﬁces to demonstrate: let us suppose that a cluster is built from reliable

machines with a mean-time between failures (MTBF) of 1000 days (about three years). Even with

these reliable servers, a 10,000-server cluster would still experience roughly 10 failures a day. For

the sake of argument, let us suppose that a MTBF of 10,000 days (about 30 years) were achievable

at realistic costs (which is unlikely). Even then, a 10,000-server cluster would still experience one

failure daily. This means that any large-scale service that is distributed across a large cluster (either a

user-facing application or a computing platform like MapReduce) must cope with hardware failures

as an intrinsic aspect of its operation [66]. That is, a server may fail at any time, without notice. For

example, in large clusters disk failures are common [123] and RAM experiences more errors than

one might expect [135]. Datacenters suffer from both planned outages (e.g., system maintenance

and hardware upgrades) and unexpected outages (e.g., power failure, connectivity loss, etc.).

Awell-designed, fault-tolerant service must cope with failures up to a point without impacting

the quality of service—failures should not result in inconsistencies or indeterminism from the user

perspective. As servers go down, other cluster nodes should seamlessly step in to handle the load,

and overall performance should gracefully degrade as server failures pile up. Just as important, a

broken server that has been repaired should be able to seamlessly rejoin the service without manual

reconﬁguration by the administrator. Mature implementations of the MapReduce programming

model are able to robustly cope with failures through a number of mechanisms such as automatic

task restarts on different cluster nodes.

Move processing to the data. In traditional high-performance computing (HPC) applications (e.g.,

for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”

and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloads

are not very processor-demanding, which means that the separation of compute and storage creates

a bottleneck in the network. As an alternative to moving data around, it is more efﬁcient to move

the processing around. That is, MapReduce assumes an architecture where processors and storage

(disk) are co-located. In such a setup, we can take advantage of data locality by running code on the

processor directly attached to the block of data we need. The distributed ﬁle system is responsible

for managing the data over which MapReduce operates.

Process data sequentially and avoid random access. Data-intensive processing by deﬁnition means

that the relevant datasets are too large to ﬁt in memory and must be held on disk. Seek times for

random disk access are fundamentally limited by the mechanical nature of the devices: read heads

1.2. BIGIDEAS 11

can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoid

random data access, and instead organize computations so that data are processed sequentially. A

simple scenario

10

poignantly illustrates the large performance gap between sequential operations

and random seeks: assume a 1 terabyte database containing 10

10

100-byte records. Given reasonable

assumptions about disk latency and throughput, a back-of-the-envelop calculation will show that

updating 1% of the records (by accessing and then mutating each record) will take about a month

on a single machine. On the other hand, if one simply reads the entire database and rewrites all

the records (mutating those that need updating), the process would ﬁnish in under a work day on

a single machine. Sequential data access is, literally, orders of magnitude faster than random data

access.

11

The development of solid-state drives is unlikely to change this balance for at least two

reasons. First, the cost differential between traditional magnetic disks and solid-state disks remains

substantial: large-data will for the most part remain on mechanical drives, at least in the near

future. Second, although solid-state disks have substantially faster seek times, order-of-magnitude

differences in performance between sequential and random access still remain.

MapReduce is primarily designed for batch processing over large datasets. To the extent

possible, all computations are organized into long streaming operations that take advantage of the

aggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly trade

latency for throughput.

Hide system-level details fromthe application developer. According to many guides on the practice

of software engineering written by experienced industry professionals, one of the key reasons why

writing code is difﬁcult is because the programmer must simultaneously keep track of many details

in short-term memory—ranging from the mundane (e.g., variable names) to the sophisticated (e.g.,

a corner case of an algorithm that requires special treatment). This imposes a high cognitive load

and requires intense concentration, which leads to a number of recommendations about a program-

mer’s environment (e.g., quiet ofﬁce, comfortable furniture, large monitors, etc.). The challenges in

writing distributed software are greatly compounded—the programmer must manage details across

several threads, processes, or machines. Of course, the biggest headache in distributed programming

is that code runs concurrently in unpredictable orders, accessing data in unpredictable patterns.

This gives rise to race conditions, deadlocks, and other well-known problems. Programmers are

taught to use low-level devices such as mutexes and to apply high-level “design patterns” such as

producer–consumer queues to tackle these challenges, but the truth remains: concurrent programs

are notoriously difﬁcult to reason about and even harder to debug.

MapReduce addresses the challenges of distributed programming by providing an abstraction

that isolates the developer from system-level details (e.g., locking of data structures, data starvation

issues in the processing pipeline, etc.). The programming model speciﬁes simple and well-deﬁned

interfaces between a small number of components, and therefore is easy for the programmer to reason

10

Adapted from a post by Ted Dunning on the Hadoop mailing list.

11

For more detail, Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.

12 1. INTRODUCTION

about. MapReduce maintains a separation of what computations are to be performed and how those

computations are actually carried out on a cluster of machines. The ﬁrst is under the control of

the programmer, while the second is exclusively the responsibility of the execution framework or

“runtime”. The advantage is that the execution framework only needs to be designed once and veriﬁed

for correctness—thereafter, as long as the developer expresses computations in the programming

model, code is guaranteed to behave as expected. The upshot is that the developer is freed from

having to worry about system-level details (e.g., no more debugging race conditions and addressing

lock contention) and can instead focus on algorithm or application design.

Seamless scalability. For data-intensive processing, it goes without saying that scalable algorithms

are highly desirable. As an aspiration, let us sketch the behavior of an ideal algorithm. We can deﬁne

scalability along at least two dimensions.

12

First, in terms of data: given twice the amount of data,

the same algorithm should take at most twice as long to run, all else being equal. Second, in terms of

resources: given a cluster twice the size, the same algorithm should take no more than half as long to

run. Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across

a wide range of settings: on data ranging from gigabytes to petabytes, on clusters consisting of a

few to a few thousand machines. Finally, the ideal algorithm would exhibit these desired behaviors

without requiring any modiﬁcations whatsoever, not even tuning of parameters.

Other than for embarrassingly parallel problems, algorithms with the characteristics sketched

above are, of course, unobtainable. One of the fundamental assertions in Fred Brook’s classic The

Mythical Man-Month [28] is that adding programmers to a project behind schedule will only make it

fall further behind. This is because complex tasks cannot be chopped into smaller pieces and allocated

in a linear fashion, and is often illustrated with a cute quote: “nine women cannot have a baby in

one month”. Although Brook’s observations are primarily about software engineers and the software

development process, the same is also true of algorithms: increasing the degree of parallelization

also increases communication costs. The algorithm designer is faced with diminishing returns, and

beyond a certain point, greater efﬁciencies gained by parallelization are entirely offset by increased

communication requirements.

Nevertheless, these fundamental limitations shouldn’t prevent us from at least striving for the

unobtainable. The truth is that most current algorithms are far from the ideal. In the domain of text

processing, for example, most algorithms today assume that data ﬁt in memory on a single machine.

For the most part, this is a fair assumption. But what happens when the amount of data doubles in the

near future, and then doubles again shortly thereafter? Simply buying more memory is not a viable

solution, as the amount of data is growing faster than the price of memory is falling. Furthermore,

the price of a machine does not scale linearly with the amount of available memory beyond a certain

point (once again, the scaling “up” vs. scaling “out” argument). Quite simply, algorithms that require

holding intermediate data in memory on a single machine will break on sufﬁciently large datasets—

moving from a single machine to a cluster architecture requires fundamentally different algorithms

(and reimplementations).

12

See also DeWitt and Gray [50] for slightly different deﬁnitions in terms of speedup and scaleup.

1.3. WHY ISTHIS DIFFERENT? 13

Perhaps the most exciting aspect of MapReduce is that it represents a small step toward

algorithms that behave in the ideal manner discussed above. Recall that the programming model

maintains a clear separation between what computations need to occur with howthose computations

are actually orchestrated on a cluster. As a result, a MapReduce algorithm remains ﬁxed, and it is

the responsibility of the execution framework to execute the algorithm. Amazingly, the MapReduce

programming model is simple enough that it is actually possible, in many circumstances, to approach

the ideal scaling characteristics discussed above. We introduce the idea of the “tradeable machine

hour”, as a play on Brook’s classic title. If running an algorithm on a particular dataset takes 100

machine hours, then we should be able to ﬁnish in an hour on a cluster of 100 machines, or use a

cluster of 10 machines to complete the same task in ten hours.

13

With MapReduce, this isn’t so far

from the truth, at least for some applications.

1.3 WHY ISTHIS DIFFERENT?

“Due to the rapidly decreasing cost of processing, memory, and communication, it has

appeared inevitable for at least two decades that parallel machines will eventually displace

sequential ones in computationally intensive domains. This, however, has not happened.”

— Leslie Valiant [148]

14

For several decades, computer scientists have predicted that the dawn of the age of parallel

computing was “right around the corner” and that sequential processing would soon fade into obso-

lescence (consider, for example, the above quote). Yet, until very recently, they have been wrong. The

relentless progress of Moore’s Law for several decades has ensured that most of the world’s prob-

lems could be solved by single-processor machines, save the needs of a few (scientists simulating

molecular interactions or nuclear reactions, for example). Couple that with the inherent challenges

of concurrency, and the result has been that parallel processing and distributed systems have largely

been conﬁned to a small segment of the market and esoteric upper-level electives in the computer

science curriculum.

However, all of that changed around the middle of the ﬁrst decade of this century. The

manner in which the semiconductor industry had been exploiting Moore’s Law simply ran out

of opportunities for improvement: faster clocks, deeper pipelines, superscalar architectures, and

other tricks of the trade reached a point of diminishing returns that did not justify continued

investment. This marked the beginning of an entirely new strategy and the dawn of the multi-core

era [115]. Unfortunately, this radical shift in hardware architecture was not matched at that time by

corresponding advances in how software could be easily designed for these new processors (but not

for lack of trying [104]). Nevertheless, parallel processing became an important issue at the forefront

of everyone’s mind—it represented the only way forward.

13

Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would cost the same

as a 10-machine cluster running for ten hours.

14

Guess when this was written? You may be surprised.

14 1. INTRODUCTION

At around the same time, we witnessed the growth of large-data problems. In the late 1990s

andevenduringthe beginningof the ﬁrst decade of this century, relatively feworganizations haddata-

intensive processing needs that required large clusters: a handful of internet companies and perhaps

a few dozen large corporations. But then, everything changed. Through a combination of many

different factors (falling prices of disks, rise of user-generated web content, etc.), large-data problems

began popping up everywhere. Data-intensive processing needs became widespread, which drove

innovations in distributed computing such as MapReduce—ﬁrst by Google, and then by Yahoo and

the open source community. This in turn created more demand: when organizations learned about

the availability of effective data analysis tools for large datasets, they began instrumenting various

business processes to gather even more data—driven by the belief that more data lead to deeper

insights and greater competitive advantages. Today, not only are large-data problems ubiquitous,

but technological solutions for addressing them are widely accessible. Anyone can download the

open-source Hadoop implementation of MapReduce, pay a modest fee to rent a cluster from a

utility cloud provider, and be happily processing terabytes upon terabytes of data within the week.

Finally, the computer scientists are right—the age of parallel computing has begun, both in terms of

multiple cores in a chip and multiple machines in a cluster (each of which often has multiple cores).

Why is MapReduce important? In practical terms, it provides a very effective tool for tackling

large-data problems. But beyond that, MapReduce is important in how it has changed the way

we organize computations at a massive scale. MapReduce represents the ﬁrst widely adopted step

away from the von Neumann model that has served as the foundation of computer science over

the last half plus century. Valiant called this a bridging model [148], a conceptual bridge between

the physical implementation of a machine and the software that is to be executed on that machine.

Until recently, the von Neumann model has served us well: Hardware designers focused on efﬁcient

implementations of the von Neumann model and didn’t have to think much about the actual soft-

ware that would run on the machines. Similarly, the software industry developed software targeted

at the model without worrying about the hardware details. The result was extraordinary growth:

chip designers churned out successive generations of increasingly powerful processors, and software

engineers were able to develop applications in high-level languages that exploited those processors.

Today, however, the von Neumann model isn’t sufﬁcient anymore: we can’t treat a multi-core

processor or a large cluster as an agglomeration of many von Neumann machine instances commu-

nicating over some interconnect. Such a view places too much burden on the software developer

to effectively take advantage of available computational resources—it simply is the wrong level of

abstraction. MapReduce can be viewed as the ﬁrst breakthrough in the quest for new abstractions

that allow us to organize computations, not over individual machines, but over entire clusters. As

Barroso puts it, the datacenter is the computer [18; 119].

To be fair, MapReduce is certainly not the ﬁrst model of parallel computation that has been

proposed. The most prevalent model in theoretical computer science, which dates back several

decades, is the PRAM [60; 77].

15

In the model, an arbitrary number of processors, sharing an

15

More than a theoretical model, the PRAM has been recently prototyped in hardware [153].

1.4. WHATTHIS BOOKIS NOT 15

unboundedly large memory, operate synchronously on a shared input to produce some output.

Other models include LogP [43] and BSP [148]. For reasons that are beyond the scope of this book,

none of these previous models have enjoyed the success that MapReduce has in terms of adoption

and in terms of impact on the daily lives of millions of users.

16

MapReduce is the most successful abstraction over large-scale computational resources we

have seen to date. However, as anyone who has taken an introductory computer science course

knows, abstractions manage complexity by hiding details and presenting well-deﬁned behaviors to

users of those abstractions. They, inevitably, are imperfect—making certain tasks easier but others

more difﬁcult, and sometimes, impossible (in the case where the detail suppressed by the abstraction

is exactly what the user cares about). This critique applies to MapReduce: it makes certain large-data

problems easier, but suffers from limitations as well. This means that MapReduce is not the ﬁnal

word, but rather the ﬁrst in a new class of programming models that will allow us to more effectively

organize computations at a massive scale.

So if MapReduce is only the beginning, what’s next beyond MapReduce? We’re getting ahead

of ourselves, as we can’t meaningfully answer this question before thoroughly understanding what

MapReduce can and cannot do well. This is exactly the purpose of this book: let us now begin our

exploration.

1.4 WHATTHIS BOOKIS NOT

Actually, not quite yet…Aﬁnal word before we get started. This book is about MapReduce algorithm

design, particularly for text processing (and related) applications. Although our presentation most

closely follows the Hadoop open-source implementation of MapReduce, this book is explicitly not

about Hadoop programming. We don’t, for example, discuss APIs, command-line invocations for

running jobs, etc. For those aspects, we refer the reader to Tom White’s excellent book, “Hadoop:

The Deﬁnitive Guide” [154].

16

Nevertheless, it is important to understand the relationship between MapReduce and existing models so that we can bring to

bear accumulated knowledge about parallel algorithms; for example, Karloff et al. [82] demonstrated that a large class of PRAM

algorithms can be efﬁciently simulated via MapReduce.

17

C H A P T E R 2

MapReduce Basics

The only feasible approach to tackling large-data problems today is to divide and conquer, a funda-

mental concept in computer science that is introduced very early in typical undergraduate curricula.

The basic idea is to partition a large problem into smaller sub-problems. To the extent that the

sub-problems are independent [5], they can be tackled in parallel by different workers—threads in a

processor core, cores in a multi-core processor, multiple processors in a machine, or many machines

in a cluster. Intermediate results from each individual worker are then combined to yield the ﬁnal

output.

1

The general principles behind divide-and-conquer algorithms are broadly applicable to a wide

range of problems in many different application domains. However, the details of their implemen-

tations are varied and complex. For example, the following are just some of the issues that need to

be addressed:

• How do we break up a large problem into smaller tasks? More speciﬁcally, how do we decom-

pose the problem so that the smaller tasks can be executed in parallel?

• How do we assign tasks to workers distributed across a potentially large number of machines

(while keeping in mind that some workers are better suited to running some tasks than others,

e.g., due to available resources, locality constraints, etc.)?

• How do we ensure that the workers get the data they need?

• How do we coordinate synchronization among the different workers?

• How do we share partial results from one worker that is needed by another?

• How do we accomplish all of the above in the face of software errors and hardware faults?

In traditional parallel or distributed programming environments, the developer needs to ex-

plicitly address many (and sometimes, all) of the above issues. In shared memory programming, the

developer needs to explicitly coordinate access to shared data structures through synchronization

primitives such as mutexes, to explicitly handle process synchronization through devices such as

barriers, and to remain ever vigilant for common problems such as deadlocks and race conditions.

Language extensions, like OpenMP for shared memory parallelism,

2

or libraries implementing the

1

We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift, but

they are far from being sufﬁciently mature to solve real-world problems.

2

http://www.openmp.org/

18 2. MAPREDUCEBASICS

Message Passing Interface (MPI) for cluster-level parallelism,

3

provide logical abstractions that hide

details of operating system synchronization and communications primitives. However, even with

these extensions, developers are still burdened to keep track of how resources are made available to

workers. Additionally, these frameworks are mostly designed to tackle processor-intensive problems

and have only rudimentary support for dealing with very large amounts of input data. When using

existing parallel computing approaches for large-data computation, the programmer must devote a

signiﬁcant amount of attention to low-level systemdetails, which detracts fromhigher-level problem

solving.

One of the most signiﬁcant advantages of MapReduce is that it provides an abstraction that

hides many system-level details from the programmer. Therefore, a developer can focus on what

computations need to be performed, as opposed to how those computations are actually carried out

or how to get the data to the processes that depend on them. Like OpenMP and MPI, MapReduce

provides a means to distribute computation without burdening the programmer with the details of

distributed computing (but at a different level of granularity). However, organizing and coordinating

large amounts of computation is only part of the challenge. Large-data processing by deﬁnition

requires bringing data and code together for computation to occur—no small feat for datasets that

are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a

simple abstraction for the developer, transparently handling most of the details behind the scenes

in a scalable, robust, and efﬁcient manner. As we mentioned in Chapter 1, instead of moving large

amounts of data around, it is far more efﬁcient, if possible, to move the code to the data. This is

operationally realized by spreading data across the local disks of nodes in a cluster and running

processes on nodes that hold the data. The complex task of managing storage in such a processing

environment is typically handled by a distributed ﬁle system that sits underneath MapReduce.

This chapter introduces the MapReduce programming model and the underlying distributed

ﬁle system. We start in Section 2.1 with an overview of functional programming, from which

MapReduce draws its inspiration. Section 2.2 introduces the basic programming model, focusing

on mappers and reducers. Section 2.3 discusses the role of the execution framework in actually

running MapReduce programs (called jobs). Section 2.4 ﬁlls in additional details by introducing

partitioners and combiners, which provide greater control over data ﬂow. MapReduce would not be

practical without a tightly-integrated distributed ﬁle system that manages the data being processed;

Section2.5covers this indetail.Tying everything together, a complete cluster architecture is described

in Section 2.6 before the chapter ends with a summary.

2.1 FUNCTIONAL PROGRAMMINGROOTS

MapReduce has its roots in functional programming, which is exempliﬁed in languages such as Lisp

and ML.

4

Akey feature of functional languages is the concept of higher-order functions, or functions

3

http://www.mcs.anl.gov/mpi/

4

However, there are important characteristics of MapReduce that make it non-functional in nature—this will become apparent

later.

2.1. FUNCTIONAL PROGRAMMINGROOTS 19

f f f f f

g g g g g

Figure 2.1: Illustrationof map andfold, twohigher-order functions commonly usedtogether infunctional

programming: map takes a function f and applies it to every element in a list, while fold iteratively applies

a function g to aggregate results.

that can accept other functions as arguments. Two common built-in higher order functions are map

and fold, illustrated in Figure 2.1. Given a list, map takes as an argument a function f (that takes a

single argument) and applies it to all elements in a list (the top part of the diagram). Given a list,

fold takes as arguments a function g (that takes two arguments) and an initial value: g is ﬁrst applied

to the initial value and the ﬁrst item in the list, the result of which is stored in an intermediate

variable. This intermediate variable and the next item in the list serve as the arguments to a second

application of g, the results of which are stored in the intermediate variable. This process repeats

until all items in the list have been consumed; fold then returns the ﬁnal value of the intermediate

variable. Typically, map and fold are used in combination. For example, to compute the sumof squares

of a list of integers, one could map a function that squares its argument (i.e., λx.x

2

) over the input

list, and then fold the resulting list with the addition function (more precisely, λxλy.x +y) using

an initial value of zero.

We can view map as a concise way to represent the transformation of a dataset (as deﬁned

by the function f ). In the same vein, we can view fold as an aggregation operation, as deﬁned by

the function g. One immediate observation is that the application of f to each item in a list (or

more generally, to elements in a large dataset) can be parallelized in a straightforward manner, since

each functional application happens in isolation. In a cluster, these operations can be distributed

across many different machines. The fold operation, on the other hand, has more restrictions on

data locality—elements in the list must be “brought together” before the function g can be applied.

However, many real-world applications do not require g to be applied to all elements of the list. To

the extent that elements in the list can be divided into groups, the fold aggregations can also proceed

20 2. MAPREDUCEBASICS

in parallel. Furthermore, for operations that are commutative and associative, signiﬁcant efﬁciencies

can be gained in the fold operation through local aggregation and appropriate reordering.

In a nutshell, we have described MapReduce. The map phase in MapReduce roughly corre-

sponds to the map operation in functional programming, whereas the reduce phase in MapReduce

roughly corresponds to the fold operation in functional programming. As we will discuss in detail

shortly, the MapReduce execution framework coordinates the map and reduce phases of processing

over large amounts of data on large clusters of commodity machines.

Viewed from a slightly different angle, MapReduce codiﬁes a generic “recipe” for processing

large datasets that consists of two stages. In the ﬁrst stage, a user-speciﬁed computation is applied

over all input records in a dataset. These operations occur in parallel and yield intermediate output

that is then aggregated by another user-speciﬁed computation. The programmer deﬁnes these two

types of computations, and the execution framework coordinates the actual processing (very loosely,

MapReduce provides a functional abstraction). Although such a two-stage processing structure may

appear to be very restrictive, many interesting algorithms can be expressed quite concisely—especially

if one decomposes complex algorithms into a sequence of MapReduce jobs. Subsequent chapters in

this book focus on how a number of algorithms can be implemented in MapReduce.

To be precise, MapReduce can refer to three distinct but related concepts. First, MapReduce

is a programming model, which is the sense discussed above. Second, MapReduce can refer to the

execution framework (i.e., the “runtime”) that coordinates the execution of programs written in this

particular style. Finally, MapReduce can refer to the software implementation of the programming

model and the execution framework: for example, Google’s proprietary implementation vs. the

open-source Hadoop implementation in Java. And, in fact, there are many implementations of

MapReduce, e.g., targeted speciﬁcally for multi-core processors [127], for GPGPUs [71], for the

CELL architecture [126], etc. There are some differences between the MapReduce programming

model implemented in Hadoop and Google’s proprietary implementation, which we will explicitly

discuss throughout the book. However, we take a rather Hadoop-centric view of MapReduce, since

Hadoop remains the most mature and accessible implementation to date, and therefore the one most

developers are likely to use.

2.2 MAPPERS ANDREDUCERS

Key-value pairs formthe basic data structure in MapReduce. Keys and values may be primitives such

as integers, ﬂoating point values, strings, and raw bytes, or they may be arbitrarily complex structures

(lists, tuples, associative arrays, etc.). Programmers typically need to deﬁne their own custom data

types, although a number of libraries such as Protocol Buffers,

5

Thrift,

6

and Avro

7

simplify the task.

Part of the design of MapReduce algorithms involves imposing the key-value structure on

arbitrary datasets. For a collection of web pages, keys may be URLs and values may be the actual

5

http://code.google.com/p/protobuf/

6

http://incubator.apache.org/thrift/

7

http://hadoop.apache.org/avro/

2.2. MAPPERS ANDREDUCERS 21

HTML content. For a graph, keys may represent node ids and values may contain the adjacency lists

of those nodes (see Chapter 5 for more details). In some algorithms, input keys are not particularly

meaningful and are simply ignored during processing, while in other cases input keys are used to

uniquely identify a datum (such as a record id). In Chapter 3, we discuss the role of complex keys

and values in the design of various algorithms.

In MapReduce, the programmer deﬁnes a mapper and a reducer with the following signatures:

map: (k

1

, v

1

) → [(k

2

, v

2

)]

reduce: (k

2

, [v

2

]) → [(k

3

, v

3

)]

The convention [. . .] is used throughout this book to denote a list. The input to a MapReduce

job starts as data stored on the underlying distributed ﬁle system (see Section 2.5). The mapper

is applied to every input key-value pair (split across an arbitrary number of ﬁles) to generate an

arbitrary number of intermediate key-value pairs. The reducer is applied to all values associated with

the same intermediate key to generate output key-value pairs.

8

Implicit between the map and reduce

phases is a distributed “group by” operation on intermediate keys. Intermediate data arrive at each

reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across

different reducers. Output key-value pairs from each reducer are written persistently back onto the

distributed ﬁle system (whereas intermediate key-value pairs are transient and not preserved). The

output ends up in r ﬁles on the distributed ﬁle system, where r is the number of reducers. For the

most part, there is no need to consolidate reducer output, since the r ﬁles often serve as input to yet

another MapReduce job. Figure 2.2 illustrates this two-stage processing structure.

A simple word count algorithm in MapReduce is shown in Figure 2.3. This algorithm counts

the number of occurrences of every word in a text collection, which may be the ﬁrst step in, for

example, building a unigramlanguage model (i.e., probability distribution over words in a collection).

Input key-values pairs take the form of (docid, doc) pairs stored on the distributed ﬁle system, where

the former is a unique identiﬁer for the document, and the latter is the text of the document itself.

The mapper takes an input key-value pair, tokenizes the document, and emits an intermediate key-

value pair for every word: the word itself serves as the key, and the integer one serves as the value

(denoting that we’ve seen the word once). The MapReduce execution framework guarantees that

all values associated with the same key are brought together in the reducer. Therefore, in our word

count algorithm, we simply need to sum up all counts (ones) associated with each word. The reducer

does exactly this, and emits ﬁnal key-value pairs with the word as the key, and the count as the value.

Final output is written to the distributed ﬁle system, one ﬁle per reducer. Words within each ﬁle will

be sorted by alphabetical order, and each ﬁle will contain roughly the same number of words. The

partitioner, which we discuss later in Section 2.4, controls the assignment of words to reducers. The

output can be examined by the programmer or used as input to another MapReduce program.

There are some differences between the Hadoop implementation of MapReduce and Google’s

implementation.

9

In Hadoop, the reducer is presented with a key and an iterator over all values

8

This characterization, while conceptually accurate, is a slight simpliﬁcation. See Section 2.6 for more details.

9

Personal communication, Jeff Dean.

22 2. MAPREDUCEBASICS

A B C D E F Į ȕ Ȗ į İ ȗ

b 1 2 3 6 5 2 b 7 8

mapper mapper mapper mapper

Shuffle and Sort: aggregate values by keys

b a 1 2 c c 3 6 a c 5 2 b c 7 8

a 1 5 b 2 7 c 2 9 8

reducer reducer reducer

X 5 Y 7 Z 9

Figure 2.2: Simpliﬁed view of MapReduce. Mappers are applied to all input key-value pairs, which

generate an arbitrary number of intermediate key-value pairs. Reducers are applied to all values associated

with the same key. Between the map and reduce phases lies a barrier that involves a large distributed sort

and group by.

1: class Mapper

2: method Map(docid a, doc d)

3: for all term t ∈ doc d do

4: Emit(term t, count 1)

1: class Reducer

2: method Reduce(term t, counts [c

1

, c

2

, . . .])

3: sum ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: sum ← sum+c

6: Emit(term t, count sum)

Figure 2.3: Pseudo-code for the wordcount algorithminMapReduce.The mapper emits anintermediate

key-value pair for each word in a document. The reducer sums up all counts for each word.

2.2. MAPPERS ANDREDUCERS 23

associated with the particular key. The values are arbitrarily ordered. Google’s implementation allows

the programmer to specify a secondary sort key for ordering the values (if desired)—in which case

values associated with each key would be presented to the developer’s reduce code in sorted order.

Later in Section 3.4 we discuss how to overcome this limitation in Hadoop to perform secondary

sorting. Another difference: in Google’s implementation the programmer is not allowed to change

the key in the reducer. That is, the reducer output key must be exactly the same as the reducer input

key. In Hadoop, there is no such restriction, and the reducer can emit an arbitrary number of output

key-value pairs (with different keys).

To provide a bit more implementation detail: pseudo-code provided in this book roughly

mirrors how MapReduce programs are written in Hadoop. Mappers and reducers are objects that

implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is initialized

for each map task (associated with a particular sequence of key-value pairs called an input split)

and the Map method is called on each key-value pair by the execution framework. In conﬁguring

a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the

execution framework (see next section) makes the ﬁnal determination based on the physical layout

of the data (more details in Section 2.5 and Section 2.6). The situation is similar for the reduce

phase: a reducer object is initialized for each reduce task, and the Reduce method is called once per

intermediate key. In contrast with the number of map tasks, the programmer can precisely specify the

number of reduce tasks. We will return to discuss the details of Hadoop job execution in Section 2.6,

which is dependent on an understanding of the distributed ﬁle system (covered in Section 2.5). To

reiterate: although the presentation of algorithms in this book closely mirrors the way they would

be implemented in Hadoop, our focus is on algorithm design and conceptual understanding—not

actual Hadoop programming. For that, we recommend Tom White’s book [154].

What are the restrictions on mappers and reducers? Mappers and reducers can express arbi-

trary computations over their inputs. However, one must generally be careful about use of external

resources since multiple mappers or reducers may be contending for those resources. For example,

it may be unwise for a mapper to query an external SQL database, since that would introduce a

scalability bottleneck on the number of map tasks that could be run in parallel (since they might

all be simultaneously querying the database).

10

In general, mappers can emit an arbitrary number

of intermediate key-value pairs, and they need not be of the same type as the input key-value pairs.

Similarly, reducers can emit an arbitrary number of ﬁnal key-value pairs, and they can differ in type

fromthe intermediate key-value pairs. Although not permitted in functional programming, mappers

and reducers can have side effects. This is a powerful and useful feature: for example, preserving state

across multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3). Such

algorithms can be understood as having side effects that only change state that is internal to the

mapper or reducer. While the correctness of such algorithms may be more difﬁcult to guarantee

(since the function’s behavior depends not only on the current input but on previous inputs), most

potential synchronization problems are avoided since internal state is private only to individual map-

10

Unless, of course, the database itself is highly scalable.

24 2. MAPREDUCEBASICS

pers and reducers. In other cases (see Sections 4.4 and 6.5), it may be useful for mappers or reducers

to have external side effects, such as writing ﬁles to the distributed ﬁle system. Since many mappers

and reducers are run in parallel, and the distributed ﬁle system is a shared global resource, special

care must be taken to ensure that such operations avoid synchronization conﬂicts. One strategy is

to write a temporary ﬁle that is renamed upon successful completion of the mapper or reducer [45].

In addition to the “canonical” MapReduce processing ﬂow, other variations are also possible.

MapReduce programs can contain no reducers, in which case mapper output is directly written to

disk (one ﬁle per mapper). For embarrassingly parallel problems, e.g., parse a large text collection or

independently analyze a large number of images, this would be a common pattern. The converse—a

MapReduce program with no mappers—is not possible, although in some cases it is useful for the

mapper to implement the identity function and simply pass input key-value pairs to the reducers.

This has the effect of sorting and regrouping the input for reduce-side processing. Similarly, in some

cases it is useful for the reducer to implement the identity function, in which case the programsimply

sorts and groups mapper output. Finally, running identity mappers and reducers has the effect of

regrouping and resorting the input data (which is sometimes useful).

Although in the most common case, input to a MapReduce job comes from data stored on

the distributed ﬁle system and output is written back to the distributed ﬁle system, any other system

that satisﬁes the proper abstractions can serve as a data source or sink. With Google’s MapReduce

implementation, Bigtable [34], a sparse, distributed, persistent multidimensional sorted map, is

frequently used as a source of input and as a store of MapReduce output. HBase is an open-source

Bigtable clone and has similar capabilities. Also, Hadoop has been integrated with existing MPP

(massively parallel processing) relational databases, which allows a programmer to write MapReduce

jobs over database rows anddumpoutput into a newdatabase table. Finally, insome cases MapReduce

jobs may not consume any input at all (e.g., computing π) or may only consume a small amount of

data (e.g., input parameters to many instances of processor-intensive simulations running in parallel).

2.3 THEEXECUTIONFRAMEWORK

One of the most important ideas behind MapReduce is separating the what of distributed processing

from the how. A MapReduce program, referred to as a job, consists of code for mappers and reducers

(as well as combiners and partitioners to be discussed in the next section) packaged together with

conﬁguration parameters (such as where the input lies and where the output should be stored).

The developer submits the job to the submission node of a cluster (in Hadoop, this is called the

jobtracker) and execution framework (sometimes called the “runtime”) takes care of everything else:

it transparently handles all other aspects of distributed code execution, on clusters ranging from a

single node to a few thousand nodes. Speciﬁc responsibilities include:

Scheduling. Each MapReduce job is divided into smaller units called tasks (see Section 2.6 for

more details). For example, a map task may be responsible for processing a certain block of input

key-value pairs (called an input split in Hadoop); similarly, a reduce task may handle a portion of the

2.3. THEEXECUTIONFRAMEWORK 25

intermediate key space. It is not uncommon for MapReduce jobs to have thousands of individual

tasks that need to be assigned to nodes in the cluster. In large jobs, the total number of tasks may

exceed the number of tasks that can be run on the cluster concurrently, making it necessary for

the scheduler to maintain some sort of a task queue and to track the progress of running tasks so

that waiting tasks can be assigned to nodes as they become available. Another aspect of scheduling

involves coordination among tasks belonging to different jobs (e.g., from different users). How can a

large, shared resource support several users simultaneously in a predictable, transparent, policy-driven

fashion? There has been some recent work along these lines in the context of Hadoop [131; 160].

Speculative execution is an optimization that is implemented by both Hadoop and Google’s

MapReduce implementation (called “backup tasks” [45]). Due to the barrier between the map and

reduce tasks, the map phase of a job is only as fast as the slowest map task. Similarly, the completion

time of a job is bounded by the running time of the slowest reduce task. As a result, the speed of

a MapReduce job is sensitive to what are known as stragglers, or tasks that take an usually long

time to complete. One cause of stragglers is ﬂaky hardware: for example, a machine that is suffering

from recoverable errors may become signiﬁcantly slower. With speculative execution, an identical

copy of the same task is executed on a different machine, and the framework simply uses the result

of the ﬁrst task attempt to ﬁnish. Zaharia et al. [161] presented different execution strategies in

a recent paper, and Google reported that speculative execution can improve job running times by

44%[45]. AlthoughinHadoopbothmapandreduce tasks canbe speculatively executed, the common

wisdom is that the technique is more helpful for map tasks than reduce tasks, since each copy of the

reduce task needs to pull data over the network. Note, however, that speculative execution cannot

adequately address another common cause of stragglers: skew in the distribution of values associated

with intermediate keys (leading to reduce stragglers). In text processing we often observe Zipﬁan

distributions, which means that the task or tasks responsible for processing the most frequent few

elements will run much longer than the typical task. Better local aggregation, discussed in the next

chapter, is one possible solution to this problem.

Data/code co-location. The phrase data distribution is misleading, since one of the key ideas behind

MapReduce is to move the code, not the data. However, the more general point remains—in order

for computation to occur, we need to somehow feed data to the code. In MapReduce, this issue is

inextricably intertwinedwithscheduling andrelies heavily onthe designof the underlying distributed

ﬁle system.

11

To achieve data locality, the scheduler starts tasks on the node that holds a particular

block of data (i.e., on its local drive) needed by the task. This has the effect of moving code to the

data. If this is not possible (e.g., a node is already running too many tasks), new tasks will be started

elsewhere, and the necessary data will be streamed over the network. An important optimization

here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant

data block, since inter-rack bandwidth is signiﬁcantly less than intra-rack bandwidth.

11

In the canonical case, that is. Recall that MapReduce may receive its input from other sources.

26 2. MAPREDUCEBASICS

Synchronization. In general, synchronization refers to the mechanisms by which multiple concur-

rently running processes “join up”, for example, to share intermediate results or otherwise exchange

state information. In MapReduce, synchronization is accomplished by a barrier between the map

and reduce phases of processing. Intermediate key-value pairs must be grouped by key, which is

accomplished by a large distributed sort involving all the nodes that executed map tasks and all the

nodes that will execute reduce tasks. This necessarily involves copying intermediate data over the

network, and therefore the process is commonly known as “shufﬂe and sort”. A MapReduce job

with m mappers and r reducers involves up to m×r distinct copy operations, since each mapper

may have intermediate output going to every reducer.

Note that the reduce computation cannot start until all the mappers have ﬁnished emitting

key-value pairs and all intermediate key-value pairs have been shufﬂed and sorted, since the ex-

ecution framework cannot otherwise guarantee that all values associated with the same key have

been gathered. This is an important departure from functional programming: in a fold operation,

the aggregation function g is a function of the intermediate value and the next item in the list—

which means that values can be lazily generated and aggregation can begin as soon as values are

available. In contrast, the reducer in MapReduce receives all values associated with the same key at

once. However, it is possible to start copying intermediate key-value pairs over the network to the

nodes running the reducers as soon as each mapper ﬁnishes—this is a common optimization and

implemented in Hadoop.

Error and fault handling. The MapReduce execution framework must accomplish all the tasks

above in an environment where errors and faults are the norm, not the exception. Since MapReduce

was explicitly designed around low-end commodity servers, the runtime must be especially resilient.

In large clusters, disk failures are common [123] and RAM experiences more errors than one might

expect [135]. Datacenters suffer from both planned outages (e.g., system maintenance and hardware

upgrades) and unexpected outages (e.g., power failure, connectivity loss, etc.).

And that’s just hardware. No software is bug free—exceptions must be appropriately trapped,

logged, and recovered from. Large-data problems have a penchant for uncovering obscure corner

cases in code that is otherwise thought to be bug-free. Furthermore, any sufﬁciently large dataset will

contain corrupted data or records that are mangled beyond a programmer’s imagination—resulting

in errors that one would never think to check for or trap. The MapReduce execution framework

must thrive in this hostile environment.

2.4 PARTITIONERS ANDCOMBINERS

We have thus far presented a simpliﬁed view of MapReduce. There are two additional elements that

complete the programming model: partitioners and combiners.

Partitioners are responsible for dividing up the intermediate key space and assigning inter-

mediate key-value pairs to reducers. In other words, the partitioner speciﬁes the task to which an

intermediate key-value pair must be copied. Within each reducer, keys are processed in sorted order

2.4. PARTITIONERS ANDCOMBINERS 27

(which is howthe “group by” is implemented). The simplest partitioner involves computing the hash

value of the key and then taking the mod of that value with the number of reducers. This assigns

approximately the same number of keys to each reducer (dependent on the quality of the hash func-

tion). Note, however, that the partitioner only considers the key and ignores the value—therefore, a

roughly even partitioning of the key space may nevertheless yield large differences in the number of

key-values pairs sent to each reducer (since different keys may have different numbers of associated

values). This imbalance in the amount of data associated with each key is relatively common in many

text processing applications due to the Zipﬁan distribution of word occurrences.

Combiners are an optimization in MapReduce that allow for local aggregation before the

shufﬂe and sort phase. We can motivate the need for combiners by considering the word count

algorithm in Figure 2.3, which emits a key-value pair for each word in the collection. Furthermore,

all these key-value pairs need to be copied across the network, and so the amount of intermediate data

will be larger than the input collection itself. This is clearly inefﬁcient. One solution is to perform

local aggregation on the output of each mapper, i.e., to compute a local count for a word over all the

documents processed by the mapper. With this modiﬁcation (assuming the maximum amount of

local aggregation possible), the number of intermediate key-value pairs will be at most the number

of unique words in the collection times the number of mappers (and typically far smaller because

each mapper may not encounter every word).

The combiner in MapReduce supports such an optimization. One can think of combiners as

“mini-reducers” that take place on the output of the mappers, prior to the shufﬂe and sort phase.

Each combiner operates in isolation and therefore does not have access to intermediate output from

other mappers. The combiner is provided keys and values associated with each key (the same types

as the mapper output keys and values). Critically, one cannot assume that a combiner will have the

opportunity to process all values associated with the same key. The combiner can emit any number of

key-value pairs, but the keys and values must be of the same type as the mapper output (same as the

reducer input).

12

In cases where an operation is both associative and commutative (e.g., addition or

multiplication), reducers can directly serve as combiners. In general, however, reducers and combiners

are not interchangeable.

In many cases, proper use of combiners can spell the difference between an impractical algo-

rithmand an efﬁcient algorithm. This topic will be discussed in Section 3.1, which focuses on various

techniques for local aggregation. It sufﬁces to say for now that a combiner can signiﬁcantly reduce

the amount of data that needs to be copied over the network, resulting in much faster algorithms.

The complete MapReduce model is shown in Figure 2.4. Output of the mappers is processed

by the combiners, which perform local aggregation to cut down on the number of intermediate key-

value pairs. The partitioner determines which reducer will be responsible for processing a particular

12

Anote on the implementation of combiners in Hadoop: by default, the execution framework reserves the right to use combiners at

its discretion. In reality, this means that a combiner may be invoked zero, one, or multiple times. In addition, combiners in Hadoop

may actually be invoked in the reduce phase, i.e., after key-value pairs have been copied over to the reducer, but before the user

reducer code runs. As a result, combiners must be carefully written so that they can be executed in these different environments.

Section 3.1.2 discusses this in more detail.

28 2. MAPREDUCEBASICS

A B C D E F Į ȕ Ȗ į İ ȗ

mapper mapper mapper mapper

b a 1 2 c c 3 6 a c 5 2 b c 7 8

combiner combiner combiner combiner

pp pp pp pp

b a 1 2 c 9 a c 5 2 b c 7 8

partitioner partitioner partitioner partitioner

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 9 8

p p p p

reducer reducer reducer

X 5 Y 7 Z 9

Figure 2.4: Complete viewof MapReduce, illustrating combiners and partitioners inadditionto mappers

and reducers. Combiners can be viewed as “mini-reducers” in the map phase. Partitioners determine which

reducer is responsible for a particular key.

key, and the execution framework uses this information to copy the data to the right location

during the shufﬂe and sort phase.

13

Therefore, a complete MapReduce job consists of code for the

mapper, reducer, combiner, and partitioner, along with job conﬁguration parameters. The execution

framework handles everything else.

2.5 THEDISTRIBUTEDFILESYSTEM

So far, we have mostly focused on the processing aspect of data-intensive processing, but it is impor-

tant to recognize that without data, there is nothing to compute on. In high-performance computing

(HPC) and many traditional cluster architectures, storage is viewed as a distinct and separate com-

ponent from computation. Implementations vary widely, but network-attached storage (NAS) and

storage area networks (SAN) are common; supercomputers often have dedicated subsystems for

handling storage (separate nodes, and often even separate networks). Regardless of the details, the

13

In Hadoop, partitioners are actually executed before combiners, so while Figure 2.4 is conceptually accurate, it doesn’t precisely

describe the Hadoop implementation.

2.5. THEDISTRIBUTEDFILESYSTEM 29

processing cycle remains the same at a high level: the compute nodes fetch input from storage, load

the data into memory, process the data, and then write back the results (with perhaps intermediate

checkpointing for long-running processes).

As dataset sizes increase, more compute capacity is required for processing. But as compute

capacity grows, the link between the compute nodes and the storage becomes a bottleneck. At

that point, one could invest in higher performance but more expensive networks (e.g., 10 gigabit

Ethernet) or special-purpose interconnects such as InﬁniBand (even more expensive). In most cases,

this is not a cost-effective solution, as the price of networking equipment increases non-linearly

with performance (e.g., a switch with ten times the capacity is usually more than ten times more

expensive). Alternatively, one could abandon the separation of computation and storage as distinct

components in a cluster. The distributed ﬁle system(DFS) that underlies MapReduce adopts exactly

this approach. The Google File System(GFS) [57] supports Google’s proprietary implementation of

MapReduce; in the open-source world, HDFS (Hadoop Distributed File System) is an open-source

implementation of GFS that supports Hadoop. Although MapReduce doesn’t necessarily require

the distributed ﬁle system, it is difﬁcult to realize many of the advantages of the programming model

without a storage substrate that behaves much like the DFS.

14

Of course, distributed ﬁle systems are not new [7; 32; 74; 133; 147]. The MapReduce dis-

tributed ﬁle system builds on previous work but is speciﬁcally adapted to large-data processing

workloads, and therefore departs from previous architectures in certain respects (see discussion by

Ghemawat et al. [57] in the original GFS paper.). The main idea is to divide user data into blocks

and replicate those blocks across the local disks of nodes in the cluster. Blocking data, of course, is

not a new idea, but DFS blocks are signiﬁcantly larger than block sizes in typical single-machine

ﬁle systems (64 MB by default). The distributed ﬁle system adopts a master–slave architecture in

which the master maintains the ﬁle namespace (metadata, directory structure, ﬁle to block mapping,

location of blocks, and access permissions) and the slaves manage the actual data blocks. In GFS,

the master is called the GFS master, and the slaves are called GFS chunkservers. In Hadoop, the

same roles are ﬁlled by the namenode and datanodes, respectively.

15

This book adopts the Hadoop

terminology, although for most basic ﬁle operations GFS and HDFS work much the same way. The

architecture of HDFS is shown in Figure 2.5, redrawn from a similar diagram describing GFS [57].

In HDFS, an application client wishing to read a ﬁle (or a portion thereof ) must ﬁrst contact

the namenode to determine where the actual data is stored. In response to the client request, the

namenode returns the relevant block idandthe locationwhere the block is held(i.e., whichdatanode).

The client then contacts the datanode to retrieve the data. Blocks are themselves stored on standard

single-machine ﬁle systems, soHDFSlies ontopof the standardOSstack(e.g., Linux). Animportant

feature of the design is that data is never moved through the namenode. Instead, all data transfer

14

However, there is evidence that existing POSIX-based distributed cluster ﬁle systems (e.g., GPFS or PVFS) can serve as a replace-

ment for HDFS, when properly tuned or modiﬁed for MapReduce workloads [6; 146]. This, however, remains an experimental

use case.

15

To be precise, namenode and datanode may refer to physical machines in a cluster, or they may refer to daemons running on those

machines providing the relevant services.

30 2. MAPREDUCEBASICS

HDFS namenode

(file name, block id)

(block id, block location)

HDFS namenode

File namespace

/foo/bar

block 3df2

Application

HDFS Client

instructions to datanode

datanode state

(block id, byte range)

block data

HDFS datanode

Linux file system

HDFS datanode

Linux file system

… …

Figure 2.5: The architecture of HDFS. The namenode (master) is responsible for maintaining the ﬁle

namespace and directing clients to datanodes (slaves) that actually hold data blocks containing user data.

occurs directly between clients and datanodes; communication with the namenode only involves

transfer of metadata.

By default, HDFS stores three separate copies of each data block to ensure reliability, availabil-

ity, and performance. In large clusters, the three replicas are spread across different physical racks, so

HDFS is resilient towards two common failure scenarios: individual datanode crashes and failures in

networking equipment that bring an entire rack ofﬂine. Replicating blocks across physical machines

also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs,

since multiple copies yield more opportunities to exploit locality. The namenode is in periodic com-

munication with the datanodes to ensure proper replication of all the blocks: if there aren’t enough

replicas (e.g., due to disk or machine failures or to connectivity losses due to networking equipment

failures), the namenode directs the creation of additional copies;

16

if there are too many replicas

(e.g., a repaired node rejoins the cluster), extra copies are discarded.

To create a newﬁle and write data to HDFS, the application client ﬁrst contacts the namenode,

which updates the ﬁle namespace after checking permissions and making sure the ﬁle doesn’t already

exist. The namenode allocates a new block on a suitable datanode, and the application is directed to

stream data directly to it. From the initial datanode, data is further propagated to additional replicas.

In the most recent release of Hadoop as of this writing (release 0.20.2), ﬁles are immutable—they

cannot be modiﬁed after creation. There are current plans to ofﬁcially support ﬁle appends in the

near future, which is a feature already present in GFS.

In summary, the HDFS namenode has the following responsibilities:

16

Note that the namenode coordinates the replication process, but data transfer occurs directly from datanode to datanode.

2.5. THEDISTRIBUTEDFILESYSTEM 31

• Namespace management. The namenode is responsible for maintaining the ﬁle namespace,

which includes metadata, directory structure, ﬁle to block mapping, location of blocks, and

access permissions. These data are held in memory for fast access and all mutations are per-

sistently logged.

• Coordinating ﬁle operations. The namenode directs application clients to datanodes for read

operations, and allocates blocks on suitable datanodes for write operations. All data trans-

fers occur directly between clients and datanodes. When a ﬁle is deleted, HDFS does not

immediately reclaim the available physical storage; rather, blocks are lazily garbage collected.

• Maintaining overall health of the ﬁle system. The namenode is in periodic contact with the

datanodes via heartbeat messages to ensure the integrity of the system. If the namenode

observes that a data block is under-replicated (fewer copies are stored on datanodes than the

desired replication factor), it will direct the creation of new replicas. Finally, the namenode

is also responsible for rebalancing the ﬁle system.

17

During the course of normal operations,

certain datanodes may end up holding more blocks than others; rebalancing involves moving

blocks from datanodes with more blocks to datanodes with fewer blocks. This leads to better

load balancing and more even disk utilization.

Since GFS and HDFS were speciﬁcally designed to support Google’s proprietary and the open-

source implementation of MapReduce, respectively, they were designed with a number of assump-

tions about the operational environment, which in turn inﬂuenced the design of the systems. Un-

derstanding these choices is critical to designing effective MapReduce algorithms:

• The ﬁle system stores a relatively modest number of large ﬁles. The deﬁnition of “modest”

varies by the size of the deployment, but in HDFS multi-gigabyte ﬁles are common (and

even encouraged). There are several reasons why lots of small ﬁles are to be avoided. Since the

namenode must hold all ﬁle metadata in memory, this presents an upper bound on both the

number of ﬁles and blocks that can be supported.

18

Large multi-block ﬁles represent a more

efﬁcient use of namenode memory than many single-block ﬁles (each of which consumes less

space than a single block size). In addition, mappers in a MapReduce job use individual ﬁles as

a basic unit for splitting input data. At present, there is no default mechanism in Hadoop that

allows a mapper to process multiple ﬁles. As a result, mapping over many small ﬁles will yield

as many map tasks as there are ﬁles. This results in two potential problems: ﬁrst, the startup

costs of mappers may become signiﬁcant compared to the time spent actually processing input

key-value pairs; second, this may result in an excessive number of across-the-network copy

operations during the “shufﬂe and sort” phase (recall that a MapReduce job with m mappers

and r reducers involves up to m×r distinct copy operations).

17

In Hadoop, this is a manually invoked process.

18

According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008, each block in HDFS occupies about 150 bytes

of memory on the namenode.

32 2. MAPREDUCEBASICS

• Workloads are batch oriented, dominated by long streaming reads and large sequential writes.

As a result, highsustainedbandwidthis more important thanlowlatency.This exactly describes

the nature of MapReduce jobs, which are batch operations on large amounts of data. Due to the

common-case workload, both HDFS and GFS do not implement any form of data caching.

19

• Applications are aware of the characteristics of the distributed ﬁle system. Neither HDFS nor

GFS present a general POSIX-compliant API, but rather support only a subset of possible

ﬁle operations. This simpliﬁes the design of the distributed ﬁle system, and in essence pushes

part of the data management onto the end application. One rationale for this decision is that

each application knows best how to handle data speciﬁc to that application, for example, in

terms of resolving inconsistent states and optimizing the layout of data structures.

• The ﬁle system is deployed in an environment of cooperative users. There is no discussion of

security in the original GFS paper, but HDFS explicitly assumes a datacenter environment

where only authorized users have access. File permissions in HDFS are only meant to prevent

unintended operations and can be easily circumvented.

20

• The systemis built fromunreliable but inexpensive commodity components. As a result, failures

are the normrather than the exception. HDFS is designed around a number of self-monitoring

and self-healing mechanisms to robustly cope with common failure modes.

Finally, some discussion is necessary to understand the single-master design of HDFS and GFS. It

has been demonstrated that in large-scale distributed systems, simultaneously providing consistency,

availability, and partition tolerance is impossible—this is Brewer’s so-called CAP Theorem [58].

Since partitioning is unavoidable in large-data systems, the real tradeoff is between consistency and

availability. A single-master design trades availability for consistency and signiﬁcantly simpliﬁes

implementation. If the master (HDFS namenode or GFS master) goes down, the entire ﬁle system

becomes unavailable, which trivially guarantees that the ﬁle system will never be in an inconsistent

state. An alternative design might involve multiple masters that jointly manage the ﬁle namespace—

such an architecture would increase availability (if one goes down, another can step in) at the cost

of consistency, not to mention requiring a more complex implementation (cf. [4; 105]).

The single-master design of GFS and HDFS is a well-known weakness, since if the master

goes ofﬂine, the entire ﬁle system and all MapReduce jobs running on top of it will grind to a halt.

This weakness is mitigated in part by the lightweight nature of ﬁle system operations. Recall that no

data is ever moved through the namenode and that all communication between clients and datanodes

involve only metadata. Because of this, the namenode rarely is the bottleneck, and for the most part

avoids load-induced crashes. In practice, this single point of failure is not as severe a limitation as

it may appear—with diligent monitoring of the namenode, mean time between failure measured in

months are not uncommon for production deployments. Furthermore, the Hadoop community is

19

However, since the distributed ﬁle system is built on top of a standard operating system such as Linux, there is still OS-level

caching.

20

However, there are existing plans to integrate Kerberos into Hadoop/HDFS.

2.6. HADOOP CLUSTERARCHITECTURE 33

namenode

namenode daemon

job submission node

jobtracker

t kt k t kt k t kt k

namenode daemon jobtracker

datanode daemon

Linux file system

tasktracker

datanode daemon

Linux file system

tasktracker

datanode daemon

Linux file system

tasktracker

…

slave node

…

slave node

…

slave node

Figure 2.6: Architecture of a complete Hadoop cluster, which consists of three separate components:

the HDFS master (called the namenode), the job submission node (called the jobtracker), and many slave

nodes (three shown here). Each of the slave nodes runs a tasktracker for executing map and reduce tasks

and a datanode daemon for serving HDFS data.

well aware of this problem and has developed several reasonable workarounds—for example, a warm

standby namenode that can be quickly switched over when the primary namenode fails. The open

source environment and the fact that many organizations already depend on Hadoop for production

systems virtually guarantees that more effective solutions will be developed over time.

2.6 HADOOP CLUSTERARCHITECTURE

Putting everything together, the architecture of a complete Hadoop cluster is shown in Figure 2.6.

The HDFS namenode runs the namenode daemon. The job submission node runs the jobtracker,

which is the single point of contact for a client wishing to execute a MapReduce job. The jobtracker

monitors the progress of running MapReduce jobs and is responsible for coordinating the execution

of the mappers and reducers. Typically, these services run on two separate machines, although in

smaller clusters they are often co-located. The bulk of a Hadoop cluster consists of slave nodes (only

three of which are shown in the ﬁgure) that run both a tasktracker, which is responsible for actually

running user code, and a datanode daemon, for serving HDFS data.

A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks.

Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for

task allocation. If a tasktracker is available to run tasks (in Hadoop parlance, has empty task slots),

the return acknowledgment of the tasktracker heartbeat contains task allocation information. The

number of reduce tasks is equal to the number of reducers speciﬁed by the programmer. The number

of map tasks, on the other hand, depends on many factors: the number of mappers speciﬁed by the

34 2. MAPREDUCEBASICS

programmer serves as a hint to the execution framework, but the actual number of tasks depends

on both the number of input ﬁles and the number of HDFS data blocks occupied by those ﬁles.

Each map task is assigned a sequence of input key-value pairs, called an input split in Hadoop. Input

splits are computed automatically and the execution framework strives to align themto HDFS block

boundaries so that each map task is associated with a single data block. In scheduling map tasks,

the jobtracker tries to take advantage of data locality—if possible, map tasks are scheduled on the

slave node that holds the input split, so that the mapper will be processing local data. The alignment

of input splits with HDFS block boundaries simpliﬁes task scheduling. If it is not possible to run

a map task on local data, it becomes necessary to stream input key-value pairs across the network.

Since large clusters are organized into racks, with far greater intra-rack bandwidth than inter-rack

bandwidth, the execution framework strives to at least place map tasks on a rack which has a copy

of the data block.

Although conceptually in MapReduce one can think of the mapper being applied to all input

key-value pairs and the reducer being applied to all values associated with the same key, actual job

execution is a bit more complex. In Hadoop, mappers are Java objects with a Map method (among

others). A mapper object is instantiated for every map task by the tasktracker. The life-cycle of this

object begins with instantiation, where a hook is provided in the API to run programmer-speciﬁed

code. This means that mappers can read in “side data”, providing an opportunity to load state,

static data sources, dictionaries, etc. After initialization, the Map method is called (by the execution

framework) on all key-value pairs in the input split. Since these method calls occur in the context of

the same Java object, it is possible to preserve state across multiple input key-value pairs within the

same map task—this is an important property to exploit in the design of MapReduce algorithms, as

we will see in the next chapter. After all key-value pairs in the input split have been processed, the

mapper object provides an opportunity to run programmer-speciﬁed termination code. This, too,

will be important in the design of MapReduce algorithms.

The actual execution of reducers is similar to that of the mappers. Each reducer object is

instantiated for every reduce task. The Hadoop API provides hooks for programmer-speciﬁed ini-

tialization and termination code. After initialization, for each intermediate key in the partition

(deﬁned by the partitioner), the execution framework repeatedly calls the Reduce method with an

intermediate key and an iterator over all values associated with that key. The programming model

also guarantees that intermediate keys will be presented to the Reduce method in sorted order.

Since this occurs in the context of a single object, it is possible to preserve state across multiple

intermediate keys (and associated values) within a single reduce task. Once again, this property is

critical in the design of MapReduce algorithms and will be discussed in the next chapter.

2.7 SUMMARY

This chapter provides a basic overviewof the MapReduce programming model, starting with its roots

in functional programming and continuing with a description of mappers, reducers, partitioners, and

combiners. Signiﬁcant attention is also given to the underlying distributed ﬁle system, which is a

2.7. SUMMARY 35

tightly-integrated component of the MapReduce environment. Given this basic understanding, we

now turn our attention to the design of MapReduce algorithms.

37

C H A P T E R 3

MapReduce AlgorithmDesign

A large part of the power of MapReduce comes from its simplicity: in addition to preparing the

input data, the programmer needs only to implement the mapper, the reducer, and optionally, the

combiner and the partitioner. All other aspects of execution are handled transparently by the exe-

cution framework—on clusters ranging from a single node to a few thousand nodes, over datasets

ranging from gigabytes to petabytes. However, this also means that any conceivable algorithm that

a programmer wishes to develop must be expressed in terms of a small number of rigidly deﬁned

components that must ﬁt together in very speciﬁc ways. It may not appear obvious how a multitude

of algorithms can be recast into this programming model. The purpose of this chapter is to provide,

primarily through examples, a guide to MapReduce algorithm design. These examples illustrate

what can be thought of as “design patterns” for MapReduce, which instantiate arrangements of

components and speciﬁc techniques designed to handle frequently encountered situations across a

variety of problem domains. Two of these design patterns are used in the scalable inverted indexing

algorithm we’ll present later in Chapter 4; concepts presented here will show up again in Chapter 5

(graph processing) and Chapter 6 (expectation-maximization algorithms).

Synchronization is perhaps the most tricky aspect of designing MapReduce algorithms (or

for that matter, parallel and distributed algorithms in general). Other than embarrassingly-parallel

problems, processes running on separate nodes in a cluster must, at some point in time, come

together—for example, to distribute partial results from nodes that produced them to the nodes that

will consume them. Within a single MapReduce job, there is only one opportunity for cluster-wide

synchronization—during the shufﬂe and sort stage where intermediate key-value pairs are copied

from the mappers to the reducers and grouped by key. Beyond that, mappers and reducers run in

isolation without any mechanisms for direct communication. Furthermore, the programmer has little

control over many aspects of execution, for example:

• Where a mapper or reducer runs (i.e., on which node in the cluster).

• When a mapper or reducer begins or ﬁnishes.

• Which input key-value pairs are processed by a speciﬁc mapper.

• Which intermediate key-value pairs are processed by a speciﬁc reducer.

Nevertheless, the programmer does have a number of techniques for controlling execution and

managing the ﬂow of data in MapReduce. In summary, they are:

38 3. MAPREDUCEALGORITHMDESIGN

1. The ability to construct complex data structures as keys and values to store and communicate

partial results.

2. The ability to execute user-speciﬁed initialization code at the beginning of a map or reduce

task, and the ability to execute user-speciﬁed termination code at the end of a map or reduce

task.

3. The ability to preserve state in both mappers and reducers across multiple input or intermediate

keys.

4. The ability to control the sort order of intermediate keys, and therefore the order in which a

reducer will encounter particular keys.

5. The ability to control the partitioning of the key space, and therefore the set of keys that will

be encountered by a particular reducer.

It is important to realize that many algorithms cannot be easily expressed as a single MapReduce

job. One must often decompose complex algorithms into a sequence of jobs, which requires or-

chestrating data so that the output of one job becomes the input to the next. Many algorithms are

iterative in nature, requiring repeated execution until some convergence criteria—graph algorithms

in Chapter 5 and expectation-maximization algorithms in Chapter 6 behave in exactly this way.

Often, the convergence check itself cannot be easily expressed in MapReduce. The standard solu-

tion is an external (non-MapReduce) program that serves as a “driver” to coordinate MapReduce

iterations.

This chapter explains how various techniques to control code execution and data ﬂow can

be applied to design algorithms in MapReduce. The focus is both on scalability—ensuring that

there are no inherent bottlenecks as algorithms are applied to increasingly larger datasets—and

efﬁciency—ensuring that algorithms do not needlessly consume resources and thereby reducing the

cost of parallelization. The gold standard, of course, is linear scalability: an algorithm running on

twice the amount of data should take only twice as long. Similarly, an algorithm running on twice

the number of nodes should only take half as long.

The chapter is organized as follows:

• Section 3.1 introduces the important concept of local aggregation in MapReduce and strategies

for designing efﬁcient algorithms that minimize the amount of partial results that need to be

copied across the network. The proper use of combiners is discussed in detail, as well as the

“in-mapper combining” design pattern.

• Section 3.2 uses the example of building word co-occurrence matrices on large text corpora

to illustrate two common design patterns, which we dub “pairs” and “stripes”. These two

approaches are useful in a large class of problems that require keeping track of joint events

across a large number of observations.

3.1. LOCAL AGGREGATION 39

• Section 3.3 shows how co-occurrence counts can be converted into relative frequencies using

a pattern known as “order inversion”. The sequencing of computations in the reducer can be

recast as a sorting problem, where pieces of intermediate data are sorted into exactly the order

that is required to carry out a series of computations. Often, a reducer needs to compute an

aggregate statistic on a set of elements before individual elements can be processed. Normally,

this would require two passes over the data, but with the “order inversion” design pattern,

the aggregate statistic can be computed in the reducer before the individual elements are

encountered. This may seem counter-intuitive: how can we compute an aggregate statistic on

a set of elements before encountering elements of that set? As it turns out, clever sorting of

special key-value pairs enables exactly this.

• Section 3.4 provides a general solution to secondary sorting, which is the problem of sort-

ing values associated with a key in the reduce phase. We call this technique “value-to-key

conversion”.

• Section 3.5 covers the topic of performing joins on relational datasets and presents three

different approaches: reduce-side, map-side, and memory-backed joins.

3.1 LOCAL AGGREGATION

In the context of data-intensive distributed processing, the single most important aspect of syn-

chronization is the exchange of intermediate results, from the processes that produced them to

the processes that will ultimately consume them. In a cluster environment, with the exception of

embarrassingly-parallel problems, this necessarily involves transferring data over the network. Fur-

thermore, inHadoop, intermediate results are writtento local disk before being sent over the network.

Since network and disk latencies are relatively expensive compared to other operations, reductions

in the amount of intermediate data translate into increases in algorithmic efﬁciency. In MapReduce,

local aggregation of intermediate results is one of the keys to efﬁcient algorithms. Through use of

the combiner and by taking advantage of the ability to preserve state across multiple inputs, it is

often possible to substantially reduce both the number and size of key-value pairs that need to be

shufﬂed from the mappers to the reducers.

3.1.1 COMBINERS ANDIN-MAPPERCOMBINING

We illustrate various techniques for local aggregation using the simple word count example presented

in Section 2.2. For convenience, Figure 3.1 repeats the pseudo-code of the basic algorithm, which is

quite simple: the mapper emits an intermediate key-value pair for each term observed, with the term

itself as the key and a value of one; reducers sum up the partial counts to arrive at the ﬁnal count.

The ﬁrst technique for local aggregation is the combiner, already discussed in Section 2.4.

Combiners provide a general mechanism within the MapReduce framework to reduce the amount

of intermediate data generatedby the mappers—recall that they canbe understoodas “mini-reducers”

that process the output of mappers. In this example, the combiners aggregate term counts across the

40 3. MAPREDUCEALGORITHMDESIGN

1: class Mapper

2: method Map(docid a, doc d)

3: for all term t ∈ doc d do

4: Emit(term t, count 1)

1: class Reducer

2: method Reduce(term t, counts [c

1

, c

2

, . . .])

3: sum ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: sum ← sum+c

6: Emit(term t, count sum)

Figure 3.1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from Figure 2.3).

documents processed by each map task. This results in a reduction in the number of intermediate

key-value pairs that need to be shufﬂed across the network—from the order of total number of terms

in the collection to the order of the number of unique terms in the collection.

1

An improvement on the basic algorithm is shown in Figure 3.2 (the mapper is modiﬁed but

the reducer remains the same as in Figure 3.1 and therefore is not repeated). An associative array

(i.e., Map in Java) is introduced inside the mapper to tally up term counts within a single document:

instead of emitting a key-value pair for each termin the document, this version emits a key-value pair

for each unique term in the document. Given that some words appear frequently within a document

(for example, a document about dogs is likely to have many occurrences of the word “dog”), this can

yield substantial savings in the number of intermediate key-value pairs emitted, especially for long

documents.

This basic idea can be taken one step further, as illustrated in the variant of the word count

algorithm in Figure 3.3 (once again, only the mapper is modiﬁed). The workings of this algorithm

critically depends on the details of how map and reduce tasks in Hadoop are executed, discussed

in Section 2.6. Recall, a ( Java) mapper object is created for each map task, which is responsible

for processing a block of input key-value pairs. Prior to processing any input key-value pairs, the

mapper’s Initialize method is called, which is an API hook for user-speciﬁed code. In this case,

we initialize an associative array for holding term counts. Since it is possible to preserve state across

multiple calls of the Map method (for each input key-value pair), we can continue to accumulate

partial term counts in the associative array across multiple documents, and emit key-value pairs only

1

More precisely, if the combiners take advantage of all opportunities for local aggregation, the algorithm would generate at most

m×V intermediate key-value pairs, where m is the number of mappers and V is the vocabulary size (number of unique terms in

the collection), since every term could have been observed in every mapper. However, there are two additional factors to consider.

Due to the Zipﬁan nature of term distributions, most terms will not be observed by most mappers (for example, terms that

occur only once will by deﬁnition only be observed by one mapper). On the other hand, combiners in Hadoop are treated as

optional optimizations, so there is no guarantee that the execution framework will take advantage of all opportunities for partial

aggregation.

3.1. LOCAL AGGREGATION 41

1: class Mapper

2: method Map(docid a, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H{t } ← H{t } +1 Tally counts for entire document

6: for all term t ∈ H do

7: Emit(term t, count H{t })

Figure 3.2: Pseudo-code for the improved MapReduce word count algorithm that uses an associative

array to aggregate term counts on a per-document basis. Reducer is the same as in Figure 3.1.

1: class Mapper

2: method Initialize

3: H ← new AssociativeArray

4: method Map(docid a, doc d)

5: for all term t ∈ doc d do

6: H{t } ← H{t } +1 Tally counts across documents

7: method Close

8: for all term t ∈ H do

9: Emit(term t, count H{t })

Figure 3.3: Pseudo-code for the improved MapReduce word count algorithm that demonstrates the

“in-mapper combining” design pattern. Reducer is the same as in Figure 3.1.

when the mapper has processed all documents. That is, emission of intermediate data is deferred

until the Close method in the pseudo-code. Recall that this API hook provides an opportunity to

execute user-speciﬁed code after the Map method has been applied to all input key-value pairs of

the input data split to which the map task was assigned.

With this technique, we are in essence incorporating combiner functionality directly inside

the mapper. There is no need to run a separate combiner, since all opportunities for local aggregation

are already exploited.

2

This is a sufﬁciently common design pattern in MapReduce that it’s worth

giving it a name, “in-mapper combining”, so that we can refer to the pattern more conveniently

throughout the book. We’ll see later on how this pattern can be applied to a variety of problems.

There are two main advantages to using this design pattern:

First, it provides control over when local aggregation occurs and how it exactly takes place.

In contrast, the semantics of the combiner is underspeciﬁed in MapReduce. For example, Hadoop

makes no guarantees on howmany times the combiner is applied, or that it is even applied at all. The

2

Leaving aside the minor complication that in Hadoop, combiners can be run in the reduce phase also (when merging intermediate

key-value pairs from different map tasks). However, in practice, it makes almost no difference either way.

42 3. MAPREDUCEALGORITHMDESIGN

combiner is provided as a semantics-preserving optimization to the execution framework, which has

the option of using it, perhaps multiple times, or not at all (or even in the reduce phase). In some

cases (although not in this particular example), such indeterminism is unacceptable, which is exactly

why programmers often choose to perform their own local aggregation in the mappers.

Second, in-mapper combining will typically be more efﬁcient than using actual combiners.

One reason for this is the additional overhead associated with actually materializing the key-value

pairs. Combiners reduce the amount of intermediate data that is shufﬂed across the network, but

don’t actually reduce the number of key-value pairs that are emitted by the mappers in the ﬁrst place.

With the algorithmin Figure 3.2, intermediate key-value pairs are still generated on a per-document

basis, only to be “compacted” by the combiners. This process involves unnecessary object creation and

destruction (garbage collection takes time), and furthermore, object serialization and deserialization

(when intermediate key-value pairs ﬁll the in-memory buffer holding map outputs and need to be

temporarily spilled to disk). In contrast, with in-mapper combining, the mappers will generate only

those key-value pairs that need to be shufﬂed across the network to the reducers.

There are, however, drawbacks to the in-mapper combining pattern. First, it breaks the func-

tional programming underpinnings of MapReduce, since state is being preserved across multiple

input key-value pairs. Ultimately, this isn’t a big deal, since pragmatic concerns for efﬁciency of-

ten trump theoretical “purity”, but there are practical consequences as well. Preserving state across

multiple input instances means that algorithmic behavior may depend on the order in which input

key-value pairs are encountered. This creates the potential for ordering-dependent bugs, which are

difﬁcult to debug on large datasets in the general case (although the correctness of in-mapper com-

bining for word count is easy to demonstrate). Second, there is a fundamental scalability bottleneck

associated with the in-mapper combining pattern. It critically depends on having sufﬁcient mem-

ory to store intermediate results until the mapper has completely processed all key-value pairs in

an input split. In the word count example, the memory footprint is bound by the vocabulary size,

since it is theoretically possible that a mapper encounters every term in the collection. Heap’s Law,

a well-known result in information retrieval, accurately models the growth of vocabulary size as a

function of the collection size—the somewhat surprising fact is that the vocabulary size never stops

growing.

3

Therefore, the algorithm in Figure 3.3 will scale only up to a point, beyond which the

associative array holding the partial term counts will no longer ﬁt in memory.

4

One common solution to limiting memory usage when using the in-mapper combining tech-

nique is to “block” input key-value pairs and “ﬂush” in-memory data structures periodically. The idea

is simple: instead of emitting intermediate data only after every key-value pair has been processed,

emit partial results after processing every n key-value pairs. This is straightforwardly implemented

with a counter variable that keeps track of the number of input key-value pairs that have been

3

In more detail, Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT

b

, where T is the number of

tokens in the collection. Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and b ∼ 0.5 ([101], p. 81).

4

A few more details: note what matters is that the partial term counts encountered within a particular input split ﬁts into memory.

However, as collection sizes increase, one will often want to increase the input split size to limit the growth of the number of map

tasks (in order to reduce the number of distinct copy operations necessary to shufﬂe intermediate data over the network).

3.1. LOCAL AGGREGATION 43

processed. As an alternative, the mapper could keep track of its own memory footprint and ﬂush

intermediate key-value pairs once memory usage has crossed a certain threshold. In both approaches,

either the block size or the memory usage threshold needs to be determined empirically: with too

large a value, the mapper may run out of memory, but with too small a value, opportunities for

local aggregation may be lost. Furthermore, in Hadoop physical memory is split between multiple

tasks that may be running on a node concurrently; these tasks are all competing for ﬁnite resources,

but since the tasks are not aware of each other, it is difﬁcult to coordinate resource consumption

effectively. In practice, however, one often encounters diminishing returns in performance gains

with increasing buffer sizes, such that it is not worth the effort to search for an optimal buffer size

(personal communication, Jeff Dean).

In MapReduce algorithms, the extent to which efﬁciency can be increased through local

aggregation depends on the size of the intermediate key space, the distribution of keys themselves,

and the number of key-value pairs that are emitted by each individual map task. Opportunities

for aggregation, after all, come from having multiple values associated with the same key (whether

one uses combiners or employs the in-mapper combining pattern). In the word count example, local

aggregation is effective because many words are encountered multiple times within a map task. Local

aggregation is also an effective technique for dealing with reduce stragglers (see Section 2.3) that

result from a highly skewed (e.g., Zipﬁan) distribution of values associated with intermediate keys.

In our word count example, we do not ﬁlter frequently occurring words: therefore, without local

aggregation, the reducer that’s responsible for computing the count of ‘the’ will have a lot more

work to do than the typical reducer, and therefore will likely be a straggler. With local aggregation

(either combiners or in-mapper combining), we substantially reduce the number of values associated

with frequently occurring terms, which alleviates the reduce straggler problem.

3.1.2 ALGORITHMICCORRECTNESS WITHLOCAL AGGREGATION

Although use of combiners can yield dramatic reductions in algorithm running time, care must

be taken in applying them. Since combiners in Hadoop are viewed as optional optimizations, the

correctness of the algorithm cannot depend on computations performed by the combiner or depend

on them even being run at all. In any MapReduce program, the reducer input key-value type must

match the mapper output key-value type: this implies that the combiner input and output key-value

types must match the mapper output key-value type (which is the same as the reducer input key-

value type). In cases where the reduce computation is both commutative and associative, the reducer

can also be used (unmodiﬁed) as the combiner (as is the case with the word count example). In the

general case, however, combiners and reducers are not interchangeable.

Consider a simple example: we have a large dataset where input keys are strings and input

values are integers, and we wish to compute the mean of all integers associated with the same key

(rounded to the nearest integer). A real-world example might be a large user log from a popular

website, where keys represent user ids and values represent some measure of activity such as elapsed

time for a particular session—the task would correspond to computing the mean session length on

44 3. MAPREDUCEALGORITHMDESIGN

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, integer r)

1: class Reducer

2: method Reduce(string t, integers [r

1

, r

2

, . . .])

3: sum ← 0

4: cnt ← 0

5: for all integer r ∈ integers [r

1

, r

2

, . . .] do

6: sum ← sum+r

7: cnt ← cnt +1

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.4: Pseudo-code for the basic MapReduce algorithmthat computes the meanof values associated

with the same key.

a per-user basis, which would be useful for understanding user demographics. Figure 3.4 shows the

pseudo-code of a simple algorithm for accomplishing this task that does not involve combiners. We

use an identity mapper, which simply passes all input key-value pairs to the reducers (appropriately

grouped and sorted). The reducer keeps track of the running sum and the number of integers

encountered. This information is used to compute the mean once all values are processed. The mean

is then emitted as the output value in the reducer (with the input string as the key).

This algorithmwill indeed work, but suffers fromthe same drawbacks as the basic word count

algorithm in Figure 3.1: it requires shufﬂing all key-value pairs from mappers to reducers across the

network, which is highly inefﬁcient. Unlike in the word count example, the reducer cannot be used

as a combiner in this case. Consider what would happen if we did: the combiner would compute the

mean of an arbitrary subset of values associated with the same key, and the reducer would compute

the mean of those values. As a concrete example, we know that:

Mean(1, 2, 3, 4, 5) = Mean(Mean(1, 2), Mean(3, 4, 5))

In general, the mean of means of arbitrary subsets of a set of numbers is not the same as the mean

of the set of numbers. Therefore, this approach would not produce the correct result.

5

So how might we properly take advantage of combiners? An attempt is shown in Figure 3.5.

The mapper remains the same, but we have added a combiner that partially aggregates results by

computing the numeric components necessary to arrive at the mean. The combiner receives each

string and the associated list of integer values, from which it computes the sum of those values and

5

There is, however, one special case in which using reducers as combiners would produce the correct result: if each combiner

computed the mean of equal-size subsets of the values. However, since such ﬁne-grained control over the combiners is impossible

in MapReduce, such a scenario is highly unlikely.

3.1. LOCAL AGGREGATION 45

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, integer r)

1: class Combiner

2: method Combine(string t, integers [r

1

, r

2

, . . .])

3: sum ← 0

4: cnt ← 0

5: for all integer r ∈ integers [r

1

, r

2

, . . .] do

6: sum ← sum+r

7: cnt ← cnt +1

8: Emit(string t, pair (sum, cnt )) Separate sum and count

1: class Reducer

2: method Reduce(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum+s

7: cnt ← cnt +c

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.5: Pseudo-code for an incorrect ﬁrst attempt at introducing combiners to compute the mean

of values associated with each key. The mismatch between combiner input and output key-value types

violates the MapReduce programming model.

the number of integers encountered (i.e., the count). The sum and count are packaged into a pair,

and emitted as the output of the combiner, with the same string as the key. In the reducer, pairs of

partial sums and counts can be aggregated to arrive at the mean. Up until now, all keys and values

in our algorithms have been primitives (string, integers, etc.). However, there are no prohibitions in

MapReduce for more complex types,

6

and, in fact, this represents a key technique in MapReduce

algorithm design that we introduced at the beginning of this chapter. We will frequently encounter

complex keys and values throughput the rest of this book.

Unfortunately, this algorithm will not work. Recall that combiners must have the same input

and output key-value type, which also must be the same as the mapper output type and the reducer

input type. This is clearly not the case. To understand why this restriction is necessary in the pro-

gramming model, remember that combiners are optimizations that cannot change the correctness

of the algorithm. So let us remove the combiner and see what happens: the output value type of the

6

In Hadoop, either custom types or types deﬁned using a library such as Protocol Buffers, Thrift, or Avro.

46 3. MAPREDUCEALGORITHMDESIGN

1: class Mapper

2: method Map(string t, integer r)

3: Emit(string t, pair (r, 1))

1: class Combiner

2: method Combine(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum+s

7: cnt ← cnt +c

8: Emit(string t, pair (sum, cnt ))

1: class Reducer

2: method Reduce(string t, pairs [(s

1

, c

1

), (s

2

, c

2

) . . .])

3: sum ← 0

4: cnt ← 0

5: for all pair (s, c) ∈ pairs [(s

1

, c

1

), (s

2

, c

2

) . . .] do

6: sum ← sum+s

7: cnt ← cnt +c

8: r

avg

← sum/cnt

9: Emit(string t, integer r

avg

)

Figure 3.6: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with

each key. This algorithm correctly takes advantage of combiners.

mapper is integer, so the reducer expects to receive a list of integers as values. But the reducer actually

expects a list of pairs! The correctness of the algorithm is contingent on the combiner running on

the output of the mappers, and more speciﬁcally, that the combiner is run exactly once. Recall from

our previous discussion that Hadoop makes no guarantees on how many times combiners are called;

it could be zero, one, or multiple times. This violates the MapReduce programming model.

Another stab at the algorithm is shown in Figure 3.6, and this time, the algorithm is correct.

In the mapper we emit as the value a pair consisting of the integer and one—this corresponds to a

partial count over one instance. The combiner separately aggregates the partial sums and the partial

counts (as before), and emits pairs with updated sums and counts. The reducer is similar to the

combiner, except that the mean is computed at the end. In essence, this algorithm transforms a

non-associative operation (mean of numbers) into an associative operation (element-wise sum of a

pair of numbers, with an additional division at the very end).

Let us verify the correctness of this algorithm by repeating the previous exercise: What would

happen if no combiners were run? With no combiners, the mappers would send pairs (as values)

3.2. PAIRS ANDSTRIPES 47

1: class Mapper

2: method Initialize

3: S ← new AssociativeArray

4: C ← new AssociativeArray

5: method Map(string t, integer r)

6: S{t } ← S{t } +r

7: C{t } ← C{t } +1

8: method Close

9: for all term t ∈ S do

10: Emit(term t, pair (S{t }, C{t }))

Figure 3.7: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with

each key, illustrating the in-mapper combining design pattern. Only the mapper is shown here; the reducer

is the same as in Figure 3.6

directly to the reducers. There would be as many intermediate pairs as there were input key-value

pairs, and each of those would consist of an integer and one. The reducer would still arrive at

the correct sum and count, and hence the mean would be correct. Now add in the combiners: the

algorithm would remain correct, no matter how many times they run, since the combiners merely

aggregate partial sums and counts to pass along to the reducers. Note that although the output

key-value type of the combiner must be the same as the input key-value type of the reducer, the

reducer can emit ﬁnal key-value pairs of a different type.

Finally, in Figure 3.7, we present an even more efﬁcient algorithmthat exploits the in-mapper

combining pattern. Inside the mapper, the partial sums and counts associated with each string are

held in memory across input key-value pairs. Intermediate key-value pairs are emitted only after

the entire input split has been processed; similar to before, the value is a pair consisting of the sum

and count. The reducer is exactly the same as in Figure 3.6. Moving partial aggregation from the

combiner directly into the mapper is subjected to all the trade offs and caveats discussed earlier this

section, but in this case the memory footprint of the data structures for holding intermediate data

is likely to be modest, making this variant algorithm an attractive option.

3.2 PAIRS ANDSTRIPES

One common approach for synchronization in MapReduce is to construct complex keys and values

in such a way that data necessary for a computation are naturally brought together by the execution

framework. We ﬁrst touched on this technique in the previous section, in the context of “packaging”

partial sums and counts in a complex value (i.e., pair) that is passed from mapper to combiner to

reducer. Building on previously published work [54; 94], this section introduces two common design

patterns we have dubbed “pairs” and “stripes” that exemplify this strategy.

48 3. MAPREDUCEALGORITHMDESIGN

As a running example, we focus on the problem of building word co-occurrence matrices

from large corpora, a common task in corpus linguistics and statistical natural language processing.

Formally, the co-occurrence matrix of a corpus is a square n ×n matrix where n is the number of

unique words in the corpus (i.e., the vocabulary size). A cell m

ij

contains the number of times word

w

i

co-occurs with word w

j

within a speciﬁc context—a natural unit such as a sentence, paragraph,

or a document, or a certain window of m words (where m is an application-dependent parameter).

Note that the upper and lower triangles of the matrix are identical since co-occurrence is a symmetric

relation, though in the general case relations between words need not be symmetric. For example,

a co-occurrence matrix M where m

ij

is the count of how many times word w

i

was immediately

succeeded by word w

j

would usually not be symmetric.

This task is quite common in text processing and provides the starting point to many other

algorithms, e.g., for computing statistics such as pointwise mutual information [38], for unsupervised

sense clustering [136], and more generally, a large body of work in lexical semantics based on

distributional proﬁles of words, dating back to Firth [55] and Harris [69] in the 1950s and 1960s.

The task also has applications in information retrieval (e.g., automatic thesaurus construction [137]

and stemming [157]), and other related ﬁelds such as text mining. More importantly, this problem

represents a speciﬁc instance of the task of estimating distributions of discrete joint events from a

large number of observations, a very common task in statistical natural language processing for which

there are nice MapReduce solutions. Indeed, concepts presented here are also used in Chapter 6

when we discuss expectation-maximization algorithms.

Beyond text processing, problems in many application domains share similar characteristics.

For example, a large retailer might analyze point-of-sale transaction records to identify correlated

product purchases (e.g., customers who buy this tend to also buy that), which would assist in inventory

management and product placement on store shelves. Similarly, an intelligence analyst might wish to

identify associations between re-occurring ﬁnancial transactions that are otherwise unrelated, which

might provide a clue in thwarting terrorist activity. The algorithms discussed in this section could

be adapted to tackle these related problems.

It is obvious that the space requirement for the word co-occurrence problem is O(n

2

), where

n is the size of the vocabulary, which for real-world English corpora can be hundreds of thousands

of words, or even billions of words in web-scale collections.

7

The computation of the word co-

occurrence matrix is quite simple if the entire matrix ﬁts into memory—however, in the case where

the matrix is too big to ﬁt in memory, a naïve implementation on a single machine can be very slow

as memory is paged to disk. Although compression techniques can increase the size of corpora for

which word co-occurrence matrices can be constructed on a single machine, it is clear that there are

inherent scalability limitations. We describe two MapReduce algorithms for this task that can scale

to large corpora.

7

The size of the vocabulary depends on the deﬁnition of a “word” and techniques (if any) for corpus pre-processing. One common

strategy is to replace all rare words (belowa certain frequency) with a “special” token such as <UNK>(which stands for “unknown”)

to model out-of-vocabulary words. Another technique involves replacing numeric digits with #, such that 1.32 and 1.19 both

map to the same token (#.##).

3.2. PAIRS ANDSTRIPES 49

Pseudo-code for the ﬁrst algorithm, dubbed the “pairs” approach, is shown in Figure 3.8. As

usual, document ids and the corresponding contents make up the input key-value pairs. The mapper

processes each input document and emits intermediate key-value pairs with each co-occurring word

pair as the key and the integer one (i.e., the count) as the value.This is straightforwardly accomplished

by two nested loops: the outer loop iterates over all words (the left element in the pair), and the

inner loop iterates over all neighbors of the ﬁrst word (the right element in the pair). The neighbors

of a word can either be deﬁned in terms of a sliding window or some other contextual unit such as a

sentence. The MapReduce execution framework guarantees that all values associated with the same

key are brought together in the reducer. Thus, in this case the reducer simply sums up all the values

associated with the same co-occurring word pair to arrive at the absolute count of the joint event in

the corpus, which is then emitted as the ﬁnal key-value pair. Each pair corresponds to a cell in the

word co-occurrence matrix. This algorithm illustrates the use of complex keys in order to coordinate

distributed computations.

An alternative approach, dubbed the “stripes” approach, is presented in Figure 3.9. Like the

pairs approach, co-occurring word pairs are generated by two nested loops. However, the major

difference is that instead of emitting intermediate key-value pairs for each co-occurring word pair,

co-occurrence information is ﬁrst stored in an associative array, denoted H. The mapper emits key-

value pairs with words as keys and corresponding associative arrays as values, where each associative

array encodes the co-occurrence counts of the neighbors of a particular word (i.e., its context). The

MapReduce execution framework guarantees that all associative arrays with the same key will be

brought together in the reduce phase of processing. The reducer performs an element-wise sum of

all associative arrays with the same key, accumulating counts that correspond to the same cell in

the co-occurrence matrix. The ﬁnal associative array is emitted with the same word as the key. In

contrast to the pairs approach, each ﬁnal key-value pair encodes a row in the co-occurrence matrix.

It is immediately obvious that the pairs algorithm generates an immense number of key-

value pairs compared to the stripes approach. The stripes representation is much more compact,

since with pairs the left element is repeated for every co-occurring word pair. The stripes approach

also generates fewer and shorter intermediate keys, and therefore the execution framework has less

sorting to perform. However, values in the stripes approach are more complex, and come with more

serialization and deserialization overhead than with the pairs approach.

Both algorithms can beneﬁt from the use of combiners, since the respective operations in

their reducers (addition and element-wise sum of associative arrays) are both commutative and

associative. However, combiners with the stripes approach have more opportunities to perform local

aggregation because the key space is the vocabulary—associative arrays can be merged whenever a

word is encountered multiple times by a mapper. In contrast, the key space in the pairs approach

is the cross of the vocabulary with itself, which is far larger—counts can be aggregated only when

the same co-occurring word pair is observed multiple times by an individual mapper (which is less

likely than observing multiple occurrences of a word, as in the stripes case).

50 3. MAPREDUCEALGORITHMDESIGN

1: class Mapper

2: method Map(docid a, doc d)

3: for all term w ∈ doc d do

4: for all term u ∈ Neighbors(w) do

5: Emit(pair (w, u), count 1) Emit count for each co-occurrence

1: class Reducer

2: method Reduce(pair p, counts [c

1

, c

2

, . . .])

3: s ← 0

4: for all count c ∈ counts [c

1

, c

2

, . . .] do

5: s ← s +c Sum co-occurrence counts

6: Emit(pair p, count s)

Figure 3.8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices fromlarge

corpora.

1: class Mapper

2: method Map(docid a, doc d)

3: for all term w ∈ doc d do

4: H ← new AssociativeArray

5: for all term u ∈ Neighbors(w) do

6: H{u} ← H{u} +1 Tally words co-occurring with w

7: Emit(Term w, Stripe H)

1: class Reducer

2: method Reduce(term w, stripes [H

1

, H

2

, H

3

, . . .])

3: H

f

← new AssociativeArray

4: for all stripe H ∈ stripes [H

1

, H

2

, H

3

, . . .] do

5: Sum(H

f

, H) Element-wise sum

6: Emit(term w, stripe H

f

)

Figure 3.9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices from

large corpora.

3.2. PAIRS ANDSTRIPES 51

For both algorithms, the in-mapper combining optimization discussed in the previous section

can also be applied; the modiﬁcation is sufﬁciently straightforward that we leave the implementation

as an exercise for the reader. However, the above caveats remain: there will be far fewer opportunities

for partial aggregation in the pairs approach due to the sparsity of the intermediate key space. The

sparsity of the key space also limits the effectiveness of in-memory combining, since the mapper

may run out of memory to store partial counts before all documents are processed, necessitating

some mechanism to periodically emit key-value pairs (which further limits opportunities to perform

partial aggregation). Similarly, for the stripes approach, memory management will also be more

complex than in the simple word count example. For common terms, the associative array may grow

to be quite large, necessitating some mechanism to periodically ﬂush in-memory structures.

It is important to consider potential scalability bottlenecks of either algorithm. The stripes

approach makes the assumption that, at any point in time, each associative array is small enough to

ﬁt into memory—otherwise, memory paging will signiﬁcantly impact performance. The size of the

associative array is bounded by the vocabulary size, which is itself unbounded with respect to corpus

size (recall the previous discussion of Heap’s Law). Therefore, as the sizes of corpora increase, this

will become an increasingly pressing issue—perhaps not for gigabyte-sized corpora, but certainly for

terabyte-sized and petabyte-sized corpora that will be commonplace tomorrow. The pairs approach,

on the other hand, does not suffer from this limitation, since it does not need to hold intermediate

data in memory.

Given this discussion, which approach is faster? Here, we present previously published re-

sults [94] that empirically answered this question. We implemented both algorithms in Hadoop and

applied themto a corpus of 2.27 million documents fromthe Associated Press Worldstream(APW)

totaling 5.7 GB.

8

Prior to working with Hadoop, the corpus was ﬁrst preprocessed as follows: All

XML markup was removed, followed by tokenization and stopword removal using standard tools

from the Lucene search engine. All tokens were then replaced with unique integers for a more efﬁ-

cient encoding. Figure 3.10 compares the running time of the pairs and stripes approach on different

fractions of the corpus, with a co-occurrence windowsize of two. These experiments were performed

on a Hadoop cluster with 19 slave nodes, each with 2 single-core processors and 2 disks.

Results demonstrate that the stripes approach is much faster than the pairs approach: 666

seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the entire corpus (improvement

by a factor of 5.7). The mappers in the pairs approach generated 2.6 billion intermediate key-value

pairs totaling 31.2 GB. After the combiners, this was reduced to 1.1 billion key-value pairs, which

quantiﬁes the amount of intermediate data transferred across the network. In the end, the reducers

emitted a total of 142 million ﬁnal key-value pairs (the number of non-zero cells in the co-occurrence

matrix). On the other hand, the mappers in the stripes approach generated 653 million intermediate

key-value pairs totaling 48.1 GB. After the combiners, only 28.8 million key-value pairs remained.

The reducers emitted a total of 1.69 million ﬁnal key-value pairs (the number of rows in the co-

8

This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium (LDC catalog

number LDC2007T07).

52 3. MAPREDUCEALGORITHMDESIGN

occurrence matrix). As expected, the stripes approach provided more opportunities for combiners to

aggregate intermediate results, thus greatly reducing network trafﬁc in the shufﬂe and sort phase.

Figure 3.10 also shows that both algorithms exhibit highly desirable scaling characteristics—linear

in the amount of input data. This is conﬁrmed by a linear regression applied to the running time

data, which yields an R

2

value close to one.

An additional series of experiments explored the scalability of the stripes approach along

another dimension: the size of the cluster. These experiments were made possible by Amazon’s EC2

service, which allows users to rapidly provision clusters of varying sizes for limited durations (for

more information, refer back to our discussion of utility computing in Section 1.1). Virtualized

computational units in EC2 are called instances, and the user is charged only for the instance-hours

consumed. Figure 3.11 (left) shows the running time of the stripes algorithm (on the same corpus,

with same setup as before), on varying cluster sizes, from 20 slave “small” instances all the way up to

80 slave “small” instances (along the x-axis). Running times are shown with solid squares. Figure 3.11

(right) recasts the same results to illustrate scaling characteristics. The circles plot the relative size

and speedup of the EC2 experiments, with respect to the 20-instance cluster. These results show

highly desirable linear scaling characteristics (i.e., doubling the cluster size makes the job twice as

fast). This is conﬁrmed by a linear regression with an R

2

value close to one.

Viewed abstractly, the pairs and stripes algorithms represent two different approaches to

counting co-occurring events froma large number of observations. This general description captures

the gist of many algorithms in ﬁelds as diverse as text processing, data mining, and bioinformatics.

For this reason, these two design patterns are broadly useful and frequently observed in a variety of

applications.

To conclude, it is worth noting that the pairs and stripes approaches represent endpoints

along a continuum of possibilities. The pairs approach individually records each co-occurring event,

while the stripes approach records all co-occurring events with respect a conditioning event. A

middle ground might be to record a subset of the co-occurring events with respect to a conditioning

event. We might divide up the entire vocabulary into b buckets (e.g., via hashing), so that words

co-occurring with w

i

would be divided into b smaller “sub-stripes”, associated with ten separate keys,

(w

i

, 1), (w

i

, 2) . . . (w

i

, b). This would be a reasonable solution to the memory limitations of the

stripes approach, since each of the sub-stripes would be smaller. In the case of b = |V|, where |V| is

the vocabulary size, this is equivalent to the pairs approach. In the case of b = 1, this is equivalent

to the standard stripes approach.

3.3 COMPUTINGRELATIVEFREQUENCIES

Let us build on the pairs and stripes algorithms presented in the previous section and continue with

our running example of constructing the word co-occurrence matrix M for a large corpus. Recall that

in this large square n ×n matrix, where n = |V| (the vocabulary size), cell m

ij

contains the number

of times word w

i

co-occurs with word w

j

within a speciﬁc context. The drawback of absolute counts

is that it doesn’t take into account the fact that some words appear more frequently than others. Word

3.3. COMPUTINGRELATIVEFREQUENCIES 53

0

500

1000

1500

2000

2500

3000

3500

4000

0 20 40 60 80 100

R

u

n

n

i

n

g

t

i

m

e

(

s

e

c

o

n

d

s

)

Percentage of the APW corpus

R

2

= 0.992

R

2

= 0.999

"stripes" approach

"pairs" approach

Figure 3.10: Running time of the “pairs” and “stripes” algorithms for computing word co-occurrence

matrices on different fractions of the APW corpus. These experiments were performed on a Hadoop

cluster with 19 slaves, each with 2 single-core processors and 2 disks.

0

1000

2000

3000

4000

5000

10 20 30 40 50 60 70 80 90

R

u

n

n

i

n

g

t

i

m

e

(

s

e

c

o

n

d

s

)

Size of EC2 cluster (number of slave instances)

1x

2x

3x

4x

1x 2x 3x 4x

R

e

l

a

t

i

v

e

s

p

e

e

d

u

p

Relative size of EC2 cluster

R

2

= 0.997

Figure 3.11: Running time of the stripes algorithmonthe APWcorpus withHadoopclusters of different

sizes from EC2 (left). Scaling characteristics (relative speedup) in terms of increasing Hadoop cluster

size (right).

54 3. MAPREDUCEALGORITHMDESIGN

w

i

may co-occur frequently with w

j

simply because one of the words is very common. A simple

remedy is to convert absolute counts into relative frequencies, f (w

j

|w

i

). That is, what proportion of

the time does w

j

appear in the context of w

i

? This can be computed using the following equation:

f (w

j

|w

i

) =

N(w

i

, w

j

)

w

N(w

i

, w

)

(3.1)

Here, N(·, ·) indicates the number of times a particular co-occurring word pair is observed in the

corpus. We need the count of the joint event (word co-occurrence), divided by what is known as the

marginal (the sum of the counts of the conditioning variable co-occurring with anything else).

Computing relative frequencies with the stripes approach is straightforward. In the reducer,

counts of all words that co-occur with the conditioning variable (w

i

in the above example) are

available in the associative array. Therefore, it sufﬁces to sumall those counts to arrive at the marginal

(i.e.,

w

N(w

i

, w

**)), and then divide all the joint counts by the marginal to arrive at the relative
**

frequency for all words. This implementation requires minimal modiﬁcation to the original stripes

algorithm in Figure 3.9, and illustrates the use of complex data structures to coordinate distributed

computations in MapReduce. Through appropriate structuring of keys and values, one can use the

MapReduce execution framework to bring together all the pieces of data required to perform a

computation. Note that, as before, this algorithm also assumes that each associative array ﬁts into

memory.

How might one compute relative frequencies with the pairs approach? In the pairs approach,

the reducer receives (w

i

, w

j

) as the key and the count as the value. From this alone it is not possible

to compute f (w

j

|w

i

) since we do not have the marginal. Fortunately, as in the mapper, the reducer

can preserve state across multiple keys. Inside the reducer, we can buffer in memory all the words that

co-occur with w

i

and their counts, in essence building the associative array in the stripes approach.

To make this work, we must deﬁne the sort order of the pair so that keys are ﬁrst sorted by the left

word, and then by the right word. Given this ordering, we can easily detect if all pairs associated

with the word we are conditioning on (w

i

) have been encountered. At that point we can go back

through the in-memory buffer, compute the relative frequencies, and then emit those results in the

ﬁnal key-value pairs.

There is one more modiﬁcation necessary to make this algorithm work. We must ensure that

all pairs with the same left word are sent to the same reducer. This, unfortunately, does not happen

automatically: recall that the default partitioner is based on the hash value of the intermediate key,

modulo the number of reducers. For a complex key, the raw byte representation is used to compute

the hash value. As a result, there is no guarantee that, for example, (dog, aardvark) and (dog, zebra) are

assigned to the same reducer. To produce the desired behavior, we must deﬁne a custom partitioner

that only pays attention to the left word. That is, the partitioner should partition based on the hash

of the left word only.

This algorithmwill indeed work, but it suffers fromthe same drawback as the stripes approach:

as the size of the corpus grows, so does that vocabulary size, and at some point there will not be

sufﬁcient memory to store all co-occurring words and their counts for the word we are conditioning

3.3. COMPUTINGRELATIVEFREQUENCIES 55

key values

(dog, ∗) [6327, 8514, . . .] compute marginal:

w

N(dog, w

) = 42908

(dog, aardvark) [2,1] f (aardvark|dog) = 3/42908

(dog, aardwolf) [1] f (aardwolf|dog) = 1/42908

…

(dog, zebra) [2,1,1,1] f (zebra|dog) = 5/42908

(doge, ∗) [682, . . .] compute marginal:

w

N(doge, w

) = 1267

…

Figure 3.12: Example of the sequence of key-value pairs presented to the reducer in the pairs algorithm

for computing relative frequencies. This illustrates the application of the order inversion design pattern.

on. For computing the co-occurrence matrix, the advantage of the pairs approach is that it doesn’t

suffer from any memory bottlenecks. Is there a way to modify the basic pairs approach so that this

advantage is retained?

As it turns out, such an algorithm is indeed possible, although it requires the coordination of

several mechanisms in MapReduce. The insight lies in properly sequencing data presented to the

reducer. If it were possible to somehow compute (or otherwise obtain access to) the marginal in the

reducer before processing the joint counts, the reducer could simply divide the joint counts by the

marginal to compute the relative frequencies. The notion of “before” and “after” can be captured in

the ordering of key-value pairs, which can be explicitly controlled by the programmer. That is, the

programmer can deﬁne the sort order of keys so that data needed earlier is presented to the reducer

before data that is needed later. However, we still need to compute the marginal counts. Recall that

in the basic pairs algorithm, each mapper emits a key-value pair with the co-occurring word pair

as the key. To compute relative frequencies, we modify the mapper so that it additionally emits a

“special” key of the form (w

i

, ∗), with a value of one, that represents the contribution of the word

pair to the marginal. Through use of combiners, these partial marginal counts will be aggregated

before being sent to the reducers. Alternatively, the in-mapper combining pattern can be used to

even more efﬁciently aggregate marginal counts.

In the reducer, we must make sure that the special key-value pairs representing the partial

marginal contributions are processed before the normal key-value pairs representing the joint counts.

This is accomplished by deﬁning the sort order of the keys so that pairs with the special symbol of

the form (w

i

, ∗) are ordered before any other key-value pairs where the left word is w

i

. In addition,

as before we must also properly deﬁne the partitioner to pay attention to only the left word in each

pair. With the data properly sequenced, the reducer can directly compute the relative frequencies.

A concrete example is shown in Figure 3.12, which lists the sequence of key-value pairs that a

reducer might encounter. First, the reducer is presented with the special key (dog, ∗) and a number

of values, each of which represents a partial marginal contribution from the map phase (assume here

either combiners or in-mapper combining, so the values represent partially aggregated counts). The

56 3. MAPREDUCEALGORITHMDESIGN

reducer accumulates these counts to arrive at the marginal,

w

N(dog, w

**). The reducer holds on
**

to this value as it processes subsequent keys. After (dog, ∗), the reducer will encounter a series of

keys representing joint counts; let’s say the ﬁrst of these is the key (dog, aardvark). Associated with

this key will be a list of values representing partial joint counts from the map phase (two separate

values in this case). Summing these counts will yield the ﬁnal joint count, i.e., the number of times

dog and aardvark co-occur in the entire collection. At this point, since the reducer already knows the

marginal, simple arithmetic sufﬁces to compute the relative frequency. All subsequent joint counts

are processed in exactly the same manner. When the reducer encounters the next special key-value

pair (doge, ∗), the reducer resets its internal state and starts to accumulate the marginal all over

again. Observe that the memory requirement for this algorithm is minimal, since only the marginal

(an integer) needs to be stored. No buffering of individual co-occurring word counts is necessary,

and therefore we have eliminated the scalability bottleneck of the previous algorithm.

This design pattern, which we call “order inversion”, occurs surprisingly often and across

applications in many domains. It is so named because through proper coordination, we can access

the result of a computation in the reducer (for example, an aggregate statistic) before processing the

data needed for that computation. The key insight is to convert the sequencing of computations

into a sorting problem. In most cases, an algorithm requires data in some ﬁxed order: by controlling

how keys are sorted and how the key space is partitioned, we can present data to the reducer in

the order necessary to perform the proper computations. This greatly cuts down on the amount of

partial results that the reducer needs to hold in memory.

To summarize, the speciﬁc application of the order inversion design pattern for computing

relative frequencies requires the following:

• Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its

contribution to the marginal.

• Controlling the sort order of the intermediate key so that the key-value pairs representing the

marginal contributions are processed by the reducer before any of the pairs representing the

joint word co-occurrence counts.

• Deﬁning a custom partitioner to ensure that all pairs with the same left word are shufﬂed to

the same reducer.

• Preserving state across multiple keys in the reducer to ﬁrst compute the marginal based on

the special key-value pairs and then dividing the joint counts by the marginals to arrive at the

relative frequencies.

As we will see in Chapter 4, this design pattern is also used in inverted index construction to properly

set compression parameters for postings lists.

3.4. SECONDARY SORTING 57

3.4 SECONDARY SORTING

MapReduce sorts intermediate key-value pairs by the keys during the shufﬂe and sort phase, which is

very convenient if computations inside the reducer rely on sort order (e.g., the order inversion design

pattern described in the previous section). However, what if in addition to sorting by key, we also need

to sort by value? Google’s MapReduce implementation provides built-in functionality for (optional)

secondary sorting, which guarantees that values arrive in sorted order. Hadoop, unfortunately, does

not have this capability built in.

Consider the example of sensor data from a scientiﬁc experiment: there are m sensors each

taking readings on a continuous basis, where m is potentially a large number. A dump of the sensor

data might look something like the following, where r

x

after each timestamp represents the actual

sensor reading (unimportant for this discussion, but may be a series of values, one or more complex

records, or even raw bytes of images).

(t

1

, m

1

, r

80521

)

(t

1

, m

2

, r

14209

)

(t

1

, m

3

, r

76042

)

...

(t

2

, m

1

, r

21823

)

(t

2

, m

2

, r

66508

)

(t

2

, m

3

, r

98347

)

Suppose we wish to reconstruct the activity at each individual sensor over time. A MapReduce

program to accomplish this might map over the raw data and emit the sensor id as the intermediate

key, with the rest of each record as the value:

m

1

→ (t

1

, r

80521

)

This would bring all readings from the same sensor together in the reducer. However, since Map-

Reduce makes no guarantees about the ordering of values associated with the same key, the sensor

readings will not likely be in temporal order. The most obvious solution is to buffer all the readings

in memory and then sort by timestamp before additional processing. However, it should be apparent

by now that any in-memory buffering of data introduces a potential scalability bottleneck. What if

we are working with a high frequency sensor or sensor readings over a long period of time? What

if the sensor readings themselves are large complex objects? This approach may not scale in these

cases—the reducer would run out of memory trying to buffer all values associated with the same

key.

This is a common problem, since in many applications we wish to ﬁrst group together data one

way (e.g., by sensor id), and then sort within the groupings another way (e.g., by time). Fortunately,

there is a general purpose solution, which we call the “value-to-key conversion” design pattern. The

basic idea is to move part of the value into the intermediate key to form a composite key, and let the

MapReduce execution framework handle the sorting. In the above example, instead of emitting the

sensor id as the key, we would emit the sensor id and the timestamp as a composite key:

58 3. MAPREDUCEALGORITHMDESIGN

(m

1

, t

1

) → (r

80521

)

The sensor reading itself now occupies the value. We must deﬁne the intermediate key sort order to

ﬁrst sort by the sensor id (the left element in the pair) and then by the timestamp (the right element

in the pair). We must also implement a custom partitioner so that all pairs associated with the same

sensor are shufﬂed to the same reducer.

Properly orchestrated, the key-value pairs will be presented to the reducer in the correct sorted

order:

(m

1

, t

1

) → [(r

80521

)]

(m

1

, t

2

) → [(r

21823

)]

(m

1

, t

3

) → [(r

146925

)]

. . .

However, note that sensor readings are now split across multiple keys. The reducer will need to

preserve state and keep track of when readings associated with the current sensor end and the next

sensor begin.

9

The basic tradeoff between the two approaches discussed above (buffer and in-memory sort

vs. value-to-key conversion) is where sorting is performed. One can explicitly implement secondary

sorting in the reducer, which is likely to be faster but suffers from a scalability bottleneck.

10

With

value-to-key conversion, sorting is ofﬂoaded to the MapReduce execution framework. Note that

this approach can be arbitrarily extended to tertiary, quaternary, etc. sorting. This pattern results in

many more keys for the framework to sort, but distributed sorting is a task that the MapReduce

runtime excels at since it lies at the heart of the programming model.

3.5 RELATIONAL JOINS

One popular application of Hadoop is data-warehousing. In an enterprise setting, a data warehouse

serves as a vast repository of data, holding everything from sales transactions to product inventories.

Typically, the data is relational in nature, but increasingly data warehouses are used to store semi-

structured data (e.g., query logs) as well as unstructured data. Data warehouses form a foundation

for business intelligence applications designed to provide decision support. It is widely believed that

insights gained by mining historical, current, and prospective data can yield competitive advantages

in the marketplace.

Traditionally, data warehouses have been implemented through relational databases, partic-

ularly those optimized for a speciﬁc workload known as online analytical processing (OLAP). A

number of vendors offer parallel databases, but customers ﬁnd that they often cannot cost-effectively

scale to the crushing amounts of data an organization needs to deal with today. Parallel databases

9

Alternatively, Hadoop provides API hooks to deﬁne “groups” of intermediate keys that should be processed together in the reducer.

10

Note that, in principle, this need not be an in-memory sort. It is entirely possible to implement a disk-based sort within the

reducer, although one would be duplicating functionality that is already present in the MapReduce execution framework. It makes

more sense to take advantage of functionality that is already present with value-to-key conversion.

3.5. RELATIONAL JOINS 59

are often quite expensive—on the order of tens of thousands of dollars per terabyte of user data.

Over the past few years, Hadoop has gained popularity as a platform for data-warehousing. Ham-

merbacher [68], for example, discussed Facebook’s experiences with scaling up business intelligence

applications with Oracle databases, which they ultimately abandoned in favor of a Hadoop-based

solution developed in-house called Hive (which is now an open-source project). Pig [114] is a

platform for massive data analytics built on Hadoop and capable of handling structured as well as

semi-structured data. It was originally developed by Yahoo, but is now also an open-source project.

Given successful applications of Hadoop to data-warehousing and complex analytical queries

that are prevalent in such an environment, it makes sense to examine MapReduce algorithms for

manipulating relational data. This section focuses speciﬁcally on performing relational joins in Map-

Reduce. We should stress here that even though Hadoop has been applied to process relational data,

Hadoop is not a database. There is an ongoing debate between advocates of parallel databases and pro-

ponents of MapReduce regarding the merits of both approaches for OLAP-type workloads. Dewitt

and Stonebraker, two well-known ﬁgures in the database community, famously decried MapReduce

as “a major step backwards” in a controversial blog post.

11

With colleagues, they ran a series of

benchmarks that demonstrated the supposed superiority of column-oriented parallel databases over

Hadoop [120; 144]. However, see Dean and Ghemawat’s counterarguments [47] and recent attempts

at hybrid architectures [1].

We shall refrain here from participating in this lively debate, and instead focus on discussing

algorithms. From an application point of view, it is highly unlikely that an analyst interacting with

a data warehouse would ever be called upon to write MapReduce programs (and indeed, Hadoop-

based systems such as Hive and Pig present a much higher-level language for interacting with large

amounts of data). Nevertheless, it is instructive to understand the algorithms that underlie basic

relational operations.

This section presents three different strategies for performing relational joins on two datasets

(relations), generically named S and T . Let us suppose that relation S looks something like the

following:

(k

1

, s

1

, S

1

)

(k

2

, s

2

, S

2

)

(k

3

, s

3

, S

3

)

. . .

where k is the key we would like to join on, s

n

is a unique id for the tuple, and the S

n

after s

n

denotes

other attributes in the tuple (unimportant for the purposes of the join). Similarly, suppose relation

T looks something like this:

(k

1

, t

1

, T

1

)

(k

3

, t

2

, T

2

)

11

http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards/

60 3. MAPREDUCEALGORITHMDESIGN

(k

8

, t

3

, T

3

)

. . .

where k is the join key, t

n

is a unique id for the tuple, and the T

n

after t

n

denotes other attributes in

the tuple.

To make this task more concrete, we present one realistic scenario: S might represent a col-

lection of user proﬁles, in which case k could be interpreted as the primary key (i.e., user id). The

tuples might contain demographic information such as age, gender, income, etc. The other dataset,

T , might represent logs of online activity. Each tuple might correspond to a page view of a particular

URL and may contain additional information such as time spent on the page, ad revenue generated,

etc. The k in these tuples could be interpreted as the foreign key that associates each individual page

view with a user. Joining these two datasets would allow an analyst, for example, to break down

online activity in terms of demographics.

3.5.1 REDUCE-SIDEJOIN

The ﬁrst approach to relational joins is what’s known as a reduce-side join. The idea is quite simple:

we map over both datasets and emit the join key as the intermediate key, and the tuple itself as

the intermediate value. Since MapReduce guarantees that all values with the same key are brought

together, all tuples will be grouped by the join key—which is exactly what we need to performthe join

operation. This approach is known as a parallel sort-merge join in the database community [134].

In more detail, there are three different cases to consider.

The ﬁrst and simplest is a one-to-one join, where at most one tuple from S and one tuple from

T share the same join key (but it may be the case that no tuple from S shares the join key with a

tuple from T , or vice versa). In this case, the algorithm sketched above will work ﬁne. The reducer

will be presented keys and lists of values along the lines of the following:

k

23

→ [(s

64

, S

64

), (t

84

, T

84

)]

k

37

→ [(s

68

, S

68

)]

k

59

→ [(t

97

, T

97

), (s

81

, S

81

)]

k

61

→ [(t

99

, T

99

)]

. . .

Since we’ve emitted the join key as the intermediate key, we can remove it from the value to save a

bit of space.

12

If there are two values associated with a key, then we know that one must be from S

and the other must be from T . However, recall that in the basic MapReduce programming model,

no guarantees are made about value ordering, so the ﬁrst value might be from S or from T . We

can proceed to join the two tuples and perform additional computations (e.g., ﬁlter by some other

attribute, compute aggregates, etc.). If there is only one value associated with a key, this means that

no tuple in the other dataset shares the join key, so the reducer does nothing.

12

Not very important if the intermediate data is compressed.

3.5. RELATIONAL JOINS 61

Let us now consider the one-to-many join. Assume that tuples in S have unique join keys (i.e.,

k is the primary key in S), so that S is the “one” and T is the “many”. The above algorithm will still

work, but when processing each key in the reducer, we have no idea when the value corresponding

to the tuple from S will be encountered, since values are arbitrarily ordered. The easiest solution is to

buffer all values in memory, pick out the tuple from S, and then cross it with every tuple from T to

perform the join. However, as we have seen several times already, this creates a scalability bottleneck

since we may not have sufﬁcient memory to hold all the tuples with the same join key.

This is a problem that requires a secondary sort, and the solution lies in the value-to-key

conversion design pattern we just presented. In the mapper, instead of simply emitting the join key

as the intermediate key, we instead create a composite key consisting of the join key and the tuple

id (from either S or T ). Two additional changes are required: First, we must deﬁne the sort order of

the keys to ﬁrst sort by the join key, and then sort all tuple ids from S before all tuple ids from T .

Second, we must deﬁne the partitioner to pay attention to only the join key, so that all composite

keys with the same join key arrive at the same reducer.

After applying the value-to-key conversion design pattern, the reducer will be presented with

keys and values along the lines of the following:

(k

82

, s

105

) → [(S

105

)]

(k

82

, t

98

) → [(T

98

)]

(k

82

, t

101

) → [(T

101

)]

(k

82

, t

137

) → [(T

137

)]

. . .

Since both the join key and the tuple id are present in the intermediate key, we can remove them

from the value to save a bit of space.

13

Whenever the reducer encounters a new join key, it is

guaranteed that the associated value will be the relevant tuple from S. The reducer can hold this

tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new

join key is encountered). Since the MapReduce execution framework performs the sorting, there is

no need to buffer tuples (other than the single one from S). Thus, we have eliminated the scalability

bottleneck.

Finally, let us consider the many-to-many join case. Assuming that S is the smaller dataset,

the above algorithm works as well. Consider what happens at the reducer:

(k

82

, s

105

) → [(S

105

)]

(k

82

, s

124

) → [(S

124

)]

. . .

(k

82

, t

98

) → [(T

98

)]

(k

82

, t

101

) → [(T

101

)]

(k

82

, t

137

) → [(T

137

)]

. . .

13

Once again, not very important if the intermediate data is compressed.

62 3. MAPREDUCEALGORITHMDESIGN

All the tuples from S with the same join key will be encountered ﬁrst, which the reducer can buffer

in memory. As the reducer processes each tuple from T , it is crossed with all the tuples from S. Of

course, we are assuming that the tuples from S (with the same join key) will ﬁt into memory, which

is a limitation of this algorithm(and why we want to control the sort order so that the smaller dataset

comes ﬁrst).

The basic idea behind the reduce-side join is to repartition the two datasets by the join key.

The approach isn’t particularly efﬁcient since it requires shufﬂing both datasets across the network.

This leads us to the map-side join.

3.5.2 MAP-SIDEJOIN

Suppose we have two datasets that are both sorted by the join key. We can performa join by scanning

through both datasets simultaneously—this is known as a merge join in the database community. We

can parallelize this by partitioning and sorting both datasets in the same way. For example, suppose

S and T were both divided into ten ﬁles, partitioned in the same manner by the join key. Further

suppose that in each ﬁle, the tuples were sorted by the join key. In this case, we simply need to merge

join the ﬁrst ﬁle of S with the ﬁrst ﬁle of T , the second ﬁle with S with the second ﬁle of T , etc.

This can be accomplished in parallel, in the map phase of a MapReduce job—hence, a map-side

join. In practice, we map over one of the datasets (the larger one) and inside the mapper read the

corresponding part of the other dataset to perform the merge join.

14

No reducer is required, unless

the programmer wishes to repartition the output or perform further processing.

A map-side join is far more efﬁcient than a reduce-side join since there is no need to shufﬂe

the datasets over the network. But is it realistic to expect that the stringent conditions required for

map-side joins are satisﬁed? In many cases, yes. The reason is that relational joins happen within

the broader context of a workﬂow, which may include multiple steps. Therefore, the datasets that

are to be joined may be the output of previous processes (either MapReduce jobs or other code).

If the workﬂow is known in advance and relatively static (both reasonable assumptions in a mature

workﬂow), we can engineer the previous processes to generate output sorted and partitioned in a

way that makes efﬁcient map-side joins possible (in MapReduce, by using a custom partitioner

and controlling the sort order of key-value pairs). For ad hoc data analysis, reduce-side joins are a

more general, albeit less efﬁcient, solution. Consider the case where datasets have multiple keys that

one might wish to join on—then no matter how the data is organized, map-side joins will require

repartitioning of the data. Alternatively, it is always possible to repartition a dataset using an identity

mapper and reducer. But of course, this incurs the cost of shufﬂing data over the network.

There is a ﬁnal restriction to bear in mind when using map-side joins with the Hadoop

implementation of MapReduce. We assume here that the datasets to be joined were produced by

previous MapReduce jobs, so this restriction applies to keys the reducers in those jobs may emit.

Hadoop permits reducers to emit keys that are different from the input key whose values they are

14

Note that this almost always implies a non-local read.

3.6. SUMMARY 63

processing (that is, input and output keys need not be the same, nor even the same type).

15

However,

if the output key of a reducer is different fromthe input key, then the output dataset fromthe reducer

will not necessarily be partitioned in a manner consistent with the speciﬁed partitioner (because the

partitioner applies to the input keys rather than the output keys). Since map-side joins depend on

consistent partitioning and sorting of keys, the reducers used to generate data that will participate

in a later map-side join must not emit any key but the one they are currently processing.

3.5.3 MEMORY-BACKEDJOIN

In addition to the two previous approaches to joining relational data that leverage the MapReduce

framework to bring together tuples that share a common join key, there is a family of approaches we

call memory-backed joins based on random access probes. The simplest version is applicable when

one of the two datasets completely ﬁts in memory on each node. In this situation, we can load the

smaller dataset into memory in every mapper, populating an associative array to facilitate random

access to tuples based on the join key. The mapper initialization API hook (see Section 3.1.1) can

be used for this purpose. Mappers are then applied to the other (larger) dataset, and for each input

key-value pair, the mapper probes the in-memory dataset to see if there is a tuple with the same

join key. If there is, the join is performed. This is known as a simple hash join by the database

community [51].

What if neither dataset ﬁts in memory? The simplest solution is to divide the smaller dataset,

let’s say S, into n partitions, such that S = S

1

∪ S

2

∪ . . . ∪ S

n

. We can choose n so that each partition

is small enough to ﬁt in memory, and then run n memory-backed hash joins. This, of course, requires

streaming through the other dataset n times.

There is an alternative approach to memory-backed joins for cases where neither datasets

ﬁt into memory. A distributed key-value store can be used to hold one dataset in memory across

multiple machines while mapping over the other. The mappers would then query this distributed

key-value store in parallel and perform joins if the join keys match.

16

The open-source caching

system memcached can be used for exactly this purpose, and therefore we’ve dubbed this approach

memcached join. For more information, this approach is detailed in a technical report [95].

3.6 SUMMARY

This chapter provides a guide on the design of MapReduce algorithms. In particular, we present a

number of “design patterns” that capture effective solutions to common problems. In summary, they

are:

• “In-mapper combining”, where the functionality of the combiner is moved into the mapper.

Instead of emitting intermediate output for every input key-value pair, the mapper aggregates

15

In contrast, recall from Section 2.2 that in Google’s implementation, reducers’ output keys must be exactly same as their input

keys.

16

In order to achieve good performance in accessing distributed key-value stores, it is often necessary to batch queries before making

synchronous requests (to amortize latency over many requests) or to rely on asynchronous requests.

64 3. MAPREDUCEALGORITHMDESIGN

partial results across multiple input records and only emits intermediate key-value pairs after

some amount of local aggregation is performed.

• The related patterns “pairs” and “stripes” for keeping track of joint events from a large number

of observations. In the pairs approach, we keep track of each joint event separately, whereas in

the stripes approach we keep track of all events that co-occur with the same event. Although

the stripes approach is signiﬁcantly more efﬁcient, it requires memory on the order of the size

of the event space, which presents a scalability bottleneck.

• “Order inversion”, where the main idea is to convert the sequencing of computations into

a sorting problem. Through careful orchestration, we can send the reducer the result of a

computation (e.g., an aggregate statistic) before it encounters the data necessary to produce

that computation.

• “Value-to-key conversion”, which provides a scalable solution for secondary sorting. By moving

part of the value into the key, we can exploit the MapReduce execution framework itself for

sorting.

Ultimately, controlling synchronization in the MapReduce programming model boils down to ef-

fective use of the following techniques:

1. Constructing complex keys and values that bring together data necessary for a computation.

This is used in all of the above design patterns.

2. Executing user-speciﬁed initialization and termination code in either the mapper or reducer.

For example, in-mapper combining depends on emission of intermediate key-value pairs in

the map task termination code.

3. Preserving state across multiple inputs in the mapper and reducer. This is used in in-mapper

combining, order inversion, and value-to-key conversion.

4. Controlling the sort order of intermediate keys.This is used inorder inversionand value-to-key

conversion.

5. Controlling the partitioning of the intermediate key space. This is used in order inversion and

value-to-key conversion.

This concludes our overviewof MapReduce algorithmdesign. It shouldbe clear by nowthat although

the programming model forces one to express algorithms in terms of a small set of rigidly deﬁned

components, there are many tools at one’s disposal to shape the ﬂow of computation. In the next

few chapters, we will focus on speciﬁc classes of MapReduce algorithms: for inverted indexing in

Chapter 4, for graph processing in Chapter 5, and for expectation-maximization in Chapter 6.

65

C H A P T E R 4

Inverted Indexing for Text

Retrieval

Web search is the quintessential large-data problem. Given an information need expressed as a short

query consisting of a few terms, the system’s task is to retrieve relevant web objects (web pages,

PDF documents, PowerPoint slides, etc.) and present them to the user. How large is the web? It is

difﬁcult to compute exactly, but even a conservative estimate would place the size at several tens of

billions of pages, totaling hundreds of terabytes (considering text alone). In real-world applications,

users demand results quickly from a search engine—query latencies longer than a few hundred

milliseconds will try a user’s patience. Fulﬁlling these requirements is quite an engineering feat,

considering the amounts of data involved!

Nearly all retrieval engines for full-text search today rely on a data structure called an inverted

index, which given a term provides access to the list of documents that contain the term. In infor-

mation retrieval parlance, objects to be retrieved are generically called “documents” even though in

actuality they may be web pages, PDFs, or even fragments of code. Given a user query, the retrieval

engine uses the inverted index to score documents that contain the query terms with respect to

some ranking model, taking into account features such as term matches, term proximity, attributes

of the terms in the document (e.g., bold, appears in title, etc.), as well as the hyperlink structure of

the documents (e.g., PageRank [117], which we’ll discuss in Chapter 5, or related metrics such as

HITS [84] and SALSA [88]).

The web search problem decomposes into three components: gathering web content (crawl-

ing), construction of the inverted index (indexing) and ranking documents given a query (retrieval).

Crawling and indexing share similar characteristics and requirements, but these are very different

from retrieval. Gathering web content and building inverted indexes are for the most part ofﬂine

problems. Both need to be scalable and efﬁcient, but they do not need to operate in real time. In-

dexing is usually a batch process that runs periodically: the frequency of refreshes and updates is

usually dependent on the design of the crawler. Some sites (e.g., news organizations) update their

content quite frequently and need to be visited often; other sites (e.g., government regulations) are

relatively static. However, even for rapidly changing sites, it is usually tolerable to have a delay of

a few minutes until content is searchable. Furthermore, since the amount of content that changes

rapidly is relatively small, running smaller-scale index updates at greater frequencies is usually an

adequate solution.

1

Retrieval, on the other hand, is an online problem that demands sub-second

1

Leaving aside the problem of searching live data streams such a tweets, which requires different techniques and algorithms.

66 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

response time. Individual users expect low query latencies, but query throughput is equally impor-

tant since a retrieval engine must usually serve many users concurrently. Furthermore, query loads

are highly variable, depending on the time of day, and can exhibit “spikey” behavior due to special

circumstances (e.g., a breaking news event triggers a large number of searches on the same topic).

On the other hand, resource consumption for the indexing problem is more predictable.

A comprehensive treatment of web search is beyond the scope of this chapter, and even this

entire book. Explicitly recognizing this, we mostly focus on the problem of inverted indexing, the

task most amenable to solutions in MapReduce. This chapter begins by ﬁrst providing an overview

of web crawling (Section 4.1) and introducing the basic structure of an inverted index (Section 4.2).

A baseline inverted indexing algorithm in MapReduce is presented in Section 4.3. We point out a

scalability bottleneck in that algorithm, which leads to a revised version presented in Section 4.4.

Index compression is discussed in Section 4.5, which ﬁlls in missing details on building compact

index structures. Since MapReduce is primarily designed for batch-oriented processing, it does not

provide an adequate solution for the retrieval problem, an issue we discuss in Section 4.6. The chapter

concludes with a summary and pointers to additional readings.

4.1 WEBCRAWLING

Before building inverted indexes, we must ﬁrst acquire the document collection over which these

indexes are to be built. In academia and for research purposes, this can be relatively straightforward.

Standard collections for information retrieval research are widely available for a variety of genres

ranging from blogs to newswire text. For researchers who wish to explore web-scale retrieval, there

is the ClueWeb09 collection that contains one billion web pages in ten languages (totaling 25 tera-

bytes) crawled by Carnegie Mellon University in early 2009.

2

Obtaining access to these standard

collections is usually as simple as signing an appropriate data license from the distributor of the

collection, paying a reasonable fee, and arranging for receipt of the data.

3

For real-world web search, however, one cannot simply assume that the collection is already

available. Acquiring web content requires crawling, which is the process of traversing the web by

repeatedly following hyperlinks and storing downloaded pages for subsequent processing. Concep-

tually, the process is quite simple to understand: we start by populating a queue with a “seed” list

of pages. The crawler downloads pages in the queue, extracts links from those pages to add to the

queue, stores the pages for further processing, and repeats. In fact, rudimentary web crawlers can be

written in a few hundred lines of code.

However, effective andefﬁcient web crawling is far more complex.The following lists a number

of issues that real-world crawlers must contend with:

2

http://boston.lti.cs.cmu.edu/Data/clueweb09/

3

As an interesting side note, in the 1990s, research collections were distributed via postal mail on CD-ROMs, and later, on DVDs.

Electronic distribution became common earlier this decade for collections below a certain size. However, many collections today

are so large that the only practical method of distribution is shipping hard drives via postal mail.

4.1. WEBCRAWLING 67

• A web crawler must practice good “etiquette” and not overload web servers. For example, it is

common practice to wait a ﬁxed amount of time before repeated requests to the same server. In

order to respect these constraints while maintaining good throughput, a crawler typically keeps

many execution threads running in parallel and maintains many TCP connections (perhaps

hundreds) open at the same time.

• Since a crawler has ﬁnite bandwidth and resources, it must prioritize the order in which

unvisited pages are downloaded. Such decisions must be made online and in an adversarial

environment, in the sense that spammers actively create “link farms” and “spider traps” full of

spam pages to trick a crawler into overrepresenting content from a particular site.

• Most real-world web crawlers are distributed systems that run on clusters of machines, often

geographically distributed. To avoid downloading a page multiple times and to ensure data

consistency, the crawler as a whole needs mechanisms for coordination and load-balancing. It

also needs to be robust with respect to machine failures, network outages, and errors of various

types.

• Web content changes, but with different frequency depending on both the site and the nature

of the content. A web crawler needs to learn these update patterns to ensure that content is

reasonably current. Getting the right recrawl frequency is tricky: too frequent means wasted

resources, but not frequent enough leads to stale content.

• The web is full of duplicate content. Examples include multiple copies of a popular conference

paper, mirrors of frequently accessed sites such as Wikipedia, and newswire content that is

often duplicated. The problem is compounded by the fact that most repetitious pages are not

exact duplicates but near duplicates (that is, basically the same page but with different ads,

navigation bars, etc.) It is desirable during the crawling process to identify near duplicates and

select the best exemplar to index.

• The web is multilingual. There is no guarantee that pages in one language only link to pages

in the same language. For example, a professor in Asia may maintain her website in the local

language, but link to her publications inEnglish. Furthermore, many pages containa mix of text

in different languages. Since document processing techniques (e.g., tokenization, stemming)

differ by language, it is important to identify the (dominant) language on a page.

The above discussion is not meant to be an exhaustive enumeration of issues, but rather to give the

reader an appreciation of the complexities involved in this intuitively simple task. For more infor-

mation, see a recent survey on web crawling [113]. Section 4.7 provides suggestions for additional

readings.

68 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

terms postings

term

1

term

2

term

3

… d

1

p d

5

p d

6

p d

11

p

… d

11

p d

23

d

23

pp d

59

p d

84

d

84

pp

… d

1

p d

4

p d

11

d

11

pp d

19

p

3

… …

1

p

4

p

11 11

pp

19

p

Figure 4.1: Simple illustration of an inverted index. Each term is associated with a list of postings. Each

posting is comprised of a document id and a payload, denoted by p in this case. An inverted index provides

quick access to documents ids that contain a term.

4.2 INVERTEDINDEXES

In its basic form, an inverted index consists of postings lists, one associated with each term that

appears in the collection.

4

The structure of an inverted index is illustrated in Figure 4.1. A postings

list is comprised of individual postings, each of which consists of a document id and a payload—

information about occurrences of the term in the document. The simplest payload is…nothing!

For simple boolean retrieval, no additional information is needed in the posting other than the

document id; the existence of the posting itself indicates the presence of the term in the document.

The most common payload, however, is term frequency (tf), or the number of times the term occurs

in the document. More complex payloads include positions of every occurrence of the term in the

document (to support phrase queries and document scoring based on term proximity), properties

of the term (such as if it occurred in the page title or not, to support document ranking based on

notions of importance), or even the results of additional linguistic processing (for example, indicating

that the term is part of a place name, to support address searches). In the web context, anchor text

information (text associated with hyperlinks from other pages to the page in question) is useful in

enriching the representation of document content (e.g., [107]); this information is often stored in

the index as well.

In the example shown in Figure 4.1, we see that term

1

occurs in {d

1

, d

5

, d

6

, d

11

, . . .}, term

2

occurs in {d

11

, d

23

, d

59

, d

84

, . . .}, and term

3

occurs in {d

1

, d

4

, d

11

, d

19

, . . .}. In an actual implemen-

tation, we assume that documents can be identiﬁed by a unique integer ranging from 1 to n, where

n is the total number of documents.

5

Generally, postings are sorted by document id, although other

sort orders are possible as well. The document ids have no inherent semantic meaning, although

assignment of numeric ids to documents need not be arbitrary. For example, pages from the same

4

In information retrieval parlance, term is preferred over word since documents are processed (e.g., tokenization and stemming)

into basic units that are often not words in the linguistic sense.

5

It is preferable to start numbering the documents at one since it is not possible to code zero with many common compression

schemes used in information retrieval; see Section 4.5.

4.3. INVERTEDINDEXING: BASELINEIMPLEMENTATION 69

domain may be consecutively numbered. Or, alternatively, pages that are higher in quality (based,

for example, on PageRank values) might be assigned smaller numeric values so that they appear

toward the front of a postings list. Either way, an auxiliary data structure is necessary to maintain

the mapping from integer document ids to some other more meaningful handle, such as a URL.

Givena query, retrieval involves fetching postings lists associated withquery terms and travers-

ing the postings to compute the result set. In the simplest case, boolean retrieval involves set op-

erations (union for boolean OR and intersection for boolean AND) on postings lists, which can

be accomplished very efﬁciently since the postings are sorted by document id. In the general case,

however, query–document scores must be computed. Partial document scores are stored in structures

called accumulators. At the end (i.e., once all postings have been processed), the top k documents are

then extracted to yield a ranked list of results for the user. Of course, there are many optimization

strategies for query evaluation (both approximate and exact) that reduce the number of postings a

retrieval engine must examine.

The size of an inverted index varies, depending on the payload stored in each posting. If

only term frequency is stored, a well-optimized inverted index can be a tenth of the size of the

original document collection. An inverted index that stores positional information would easily be

several times larger than one that does not. Generally, it is possible to hold the entire vocabulary

(i.e., dictionary of all the terms) in memory, especially with techniques such as front-coding [156].

However, with the exception of well-resourced, commercial web search engines,

6

postings lists

are usually too large to store in memory and must be held on disk, usually in compressed form

(more details in Section 4.5). Query evaluation, therefore, necessarily involves random disk access

and “decoding” of the postings. One important aspect of the retrieval problem is to organize disk

operations such that random seeks are minimized.

Once again, this brief discussion glosses over many complexities and does a huge injustice

to the tremendous amount of research in information retrieval. However, our goal is to provide

the reader with an overview of the important issues; Section 4.7 provides references to additional

readings.

4.3 INVERTEDINDEXING: BASELINEIMPLEMENTATION

MapReduce was designed from the very beginning to produce the various data structures involved

in web search, including inverted indexes and the web graph. We begin with the basic inverted

indexing algorithm shown in Figure 4.2.

Input to the mapper consists of document ids (keys) paired with the actual content (values).

Individual documents are processed in parallel by the mappers. First, each document is analyzed and

broken down into its component terms. The processing pipeline differs depending on the application

and type of document, but for web pages typically involves stripping out HTML tags and other

elements such as JavaScript code, tokenizing, case folding, removing stopwords (common words

such as ‘the’, ‘a’, ‘of ’, etc.), and stemming (removing afﬁxes from words so that ‘dogs’ becomes ‘dog’).

6

Google keeps indexes in memory.

70 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

1: class Mapper

2: procedure Map(docid n, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H{t } ← H{t } +1

6: for all term t ∈ H do

7: Emit(term t, posting n, H{t })

1: class Reducer

2: procedure Reduce(term t, postings [n

1

, f

1

, n

2

, f

2

. . .])

3: P ← new List

4: for all posting a, f ∈ postings [n

1

, f

1

, n

2

, f

2

. . .] do

5: P.Add(a, f )

6: P.Sort()

7: Emit(term t, postings P)

Figure 4.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce. Mappers emit

postings keyed by terms, the execution framework groups postings by term, and the reducers write

postings lists to disk.

Once the document has been analyzed, termfrequencies are computed by iterating over all the terms

and keeping track of counts. Lines 4 and 5 in the pseudo-code reﬂect the process of computing term

frequencies, but hides the details of document processing. After this histogram has been built, the

mapper then iterates over all terms. For each term, a pair consisting of the document id and the term

frequency is created. Each pair, denoted by n, H{t } in the pseudo-code, represents an individual

posting. The mapper then emits an intermediate key-value pair with the term as the key and the

posting as the value, in line 7 of the mapper pseudo-code. Although as presented here only the

term frequency is stored in the posting, this algorithm can be easily augmented to store additional

information (e.g., term positions) in the payload.

In the shufﬂe and sort phase, the MapReduce runtime essentially performs a large, distributed

group by of the postings by term. Without any additional effort by the programmer, the execution

framework brings together all the postings that belong in the same postings list. This tremendously

simpliﬁes the task of the reducer, which simply needs to gather together all the postings and write

themto disk. The reducer begins by initializing an empty list and then appends all postings associated

with the same key (term) to the list. The postings are then sorted by document id, and the entire

postings list is emitted as a value, with the term as the key. Typically, the postings list is ﬁrst

compressed, but we leave this aside for now (see Section 4.4 for more details). The ﬁnal key-value

pairs are written to disk and comprise the inverted index. Since each reducer writes its output in

a separate ﬁle in the distributed ﬁle system, our ﬁnal index will be split across r ﬁles, where r is

4.3. INVERTEDINDEXING: BASELINEIMPLEMENTATION 71

one fish, two fish

doc 1

red fish, blue fish

doc 2

one red bird

doc 3

mapper mapper mapper

d

1

2 fish

d

1

1 one

d

1

1 two

d

2

1 blue

d

2

2 fish

d

2

1 red

d

3

1 bird

d

3

1 one

d

3

1 red

reducer

d

1

1 two d

2

1 red d

3

1 red

Shuffle and Sort: aggregate values by keys

reducer reducer reducer

d

1

2 fish d

2

2 d

3

1 bird

d

1

1 one

d

1

1 two

d

2

1 blue

d

2

1 red d

3

1

d

3

1

Figure 4.3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with three

mappers and two reducers. Postings are shown as pairs of boxes (docid, tf).

the number of reducers. There is no need to further consolidate these ﬁles. Separately, we must also

build an index to the postings lists themselves for the retrieval engine: this is typically in the form

of mappings from term to (ﬁle, byte offset) pairs, so that given a term, the retrieval engine can fetch

its postings list by opening the appropriate ﬁle and seeking to the correct byte offset position in that

ﬁle.

Executionof the complete algorithmis illustratedinFigure 4.3witha toy example consisting of

three documents, three mappers, and two reducers. Intermediate key-value pairs (from the mappers)

and the ﬁnal key-value pairs comprising the inverted index (from the reducers) are shown in the

boxes with dotted lines. Postings are shown as pairs of boxes, with the document id on the left and

the term frequency on the right.

The MapReduce programming model provides a very concise expression of the inverted

indexing algorithm. Its implementation is similarly concise: the basic algorithmcan be implemented

in as few as a couple dozen lines of code in Hadoop (with minimal document processing). Such an

implementation can be completed as a week-long programming assignment in a course for advanced

undergraduates or ﬁrst-year graduate students [83; 93]. In a non-MapReduce indexer, a signiﬁcant

72 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

fraction of the code is devoted to grouping postings by term, given constraints imposed by memory

and disk (e.g., memory capacity is limited, disk seeks are slow, etc.). In MapReduce, the programmer

does not need to worry about any of these issues—most of the heavy lifting is performed by the

execution framework.

4.4 INVERTEDINDEXING: REVISEDIMPLEMENTATION

The inverted indexing algorithm presented in the previous section serves as a reasonable baseline.

However, there is a signiﬁcant scalability bottleneck: the algorithm assumes that there is sufﬁcient

memory to hold all postings associated with the same term. Since the basic MapReduce execution

framework makes no guarantees about the ordering of values associated with the same key, the

reducer ﬁrst buffers all postings (line 5 of the reducer pseudo-code in Figure 4.2) and then performs

an in-memory sort before writing the postings to disk.

7

For efﬁcient retrieval, postings need to be

sorted by document id. However, as collections become larger, postings lists grow longer, and at

some point in time, reducers will run out of memory.

There is a simple solution to this problem. Since the execution framework guarantees that

keys arrive at each reducer in sorted order, one way to overcome the scalability bottleneck is to let

the MapReduce runtime do the sorting for us. Instead of emitting key-value pairs of the following

type:

(term t, posting docid, f )

We emit intermediate key-value pairs of this type instead:

(tuple t, docid, tf f )

In other words, the key is a tuple containing the term and the document id, while the value is the

term frequency. This is exactly the value-to-key conversion design pattern introduced in Section 3.4.

With this modiﬁcation, the programming model ensures that the postings arrive in the correct order.

This, combined with the fact that reducers can hold state across multiple keys, allows postings lists

to be created with minimal memory usage. As a detail, remember that we must deﬁne a custom

partitioner to ensure that all tuples with the same term are shufﬂed to the same reducer.

The revised MapReduce inverted indexing algorithm is shown in Figure 4.4. The mapper

remains unchanged for the most part, other than differences in the intermediate key-value pairs.

The Reduce method is called for each key (i.e., t, n), and by design, there will only be one

value associated with each key. For each key-value pair, a posting can be directly added to the

postings list. Since the postings are guaranteed to arrive in sorted order by document id, they can be

incrementally coded in compressed form—thus ensuring a small memory footprint. Finally, when

all postings associated with the same term have been processed (i.e., t = t

prev

), the entire postings

list is emitted. The ﬁnal postings list must be written out in the Close method. As with the baseline

7

See similar discussion in Section 3.4: in principle, this need not be an in-memory sort. It is entirely possible to implement a

disk-based sort within the reducer.

4.4. INVERTEDINDEXING: REVISEDIMPLEMENTATION 73

1: class Mapper

2: method Map(docid n, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H{t } ← H{t } +1

6: for all term t ∈ H do

7: Emit(tuple t, n, tf H{t })

1: class Reducer

2: method Initialize

3: t

prev

← ∅

4: P ← new PostingsList

5: method Reduce(tuple t, n, tf [f ])

6: if t = t

prev

∧ t

prev

= ∅ then

7: Emit(term t, postings P)

8: P.Reset()

9: P.Add(n, f )

10: t

prev

← t

11: method Close

12: Emit(term t, postings P)

Figure 4.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. By applying the

value-to-key conversion design pattern, the execution framework is exploited to sort postings so that they

arrive sorted by document id in the reducer.

algorithm, payloads can be easily changed: by simply replacing the intermediate value f (term

frequency) with whatever else is desired (e.g., term positional information).

There is one more detail we must address when building inverted indexes. Since almost all

retrieval models take into account document length when computing query–document scores, this

information must also be extracted. Although it is straightforward to express this computation as

another MapReduce job, this task can actually be folded into the inverted indexing process. When

processing the terms in each document, the document length is known, and can be written out as

“side data” directly to HDFS. We can take advantage of the ability for a mapper to hold state across

the processing of multiple documents in the following manner: an in-memory associative array

is created to store document lengths, which is populated as each document is processed.

8

When

the mapper ﬁnishes processing input records, document lengths are written out to HDFS (i.e., in

the Close method). This approach is essentially a variant of the in-mapper combining pattern.

Document length data end up in m different ﬁles, where m is the number of mappers; these ﬁles are

8

In general, there is no worry about insufﬁcient memory to hold these data.

74 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

then consolidated into a more compact representation. Alternatively, document length information

can be emitted in special key-value pairs by the mapper. One must then write a custom partitioner

so that these special key-value pairs are shufﬂed to a single reducer, which will be responsible for

writing out the length data separate from the postings lists.

4.5 INDEXCOMPRESSION

We return to the question of how postings are actually compressed and stored on disk. This chap-

ter devotes a substantial amount of space to this topic because index compression is one of the

main differences between a “toy” indexer and one that works on real-world collections. Otherwise,

MapReduce inverted indexing algorithms are pretty straightforward.

Let us consider the canonical case where each posting consists of a document id and the term

frequency. A naïve implementation might represent the ﬁrst as a 32-bit integer

9

and the second as

a 16-bit integer. Thus, a postings list might be encoded as follows:

[(5, 2), (7, 3), (12, 1), (49, 1), (51, 2), . . .]

where each posting is represented by a pair in parentheses. Note that all brackets, parentheses, and

commas are only included to enhance readability; in reality, the postings would be represented as a

long stream of integers. This naïve implementation would require six bytes per posting. Using this

scheme, the entire inverted index would be about as large as the collection itself. Fortunately, we can

do signiﬁcantly better.

The ﬁrst trick is to encode differences between document ids as opposed to the document ids

themselves. Since the postings are sorted by document ids, the differences (called d-gaps) must be

positive integers greater than zero. The above postings list, represented with d-gaps, would be:

[(5, 2), (2, 3), (5, 1), (37, 1), (2, 2), . . .]

Of course, we must actually encode the ﬁrst document id. We haven’t lost any information, since

the original document ids can be easily reconstructed from the d-gaps. However, it’s not obvious

that we’ve reduced the space requirements either, since the largest possible d-gap is one less than the

number of documents in the collection.

This is where the second trick comes in, which is to represent the d-gaps in a way such that

it takes less space for smaller numbers. Similarly, we want to apply the same techniques to compress

the term frequencies, since for the most part they are also small values. But to understand how this is

done, we need to take a slight detour into compression techniques, particularly for coding integers.

Compression, in general, can be characterized as either lossless or lossy: it’s fairly obvious that

loseless compression is required in this context. To start, it is important to understand that all

compression techniques represent a time–space tradeoff. That is, we reduce the amount of space

on disk necessary to store data, but at the cost of extra processor cycles that must be spent coding

9

However, note that 2

32

−1 is “only” 4,294,967,295, which is much less than even the most conservative estimate of the size of

the web.

4.5. INDEXCOMPRESSION 75

and decoding data. Therefore, it is possible that compression reduces size but also slows processing.

However, if the two factors are properly balanced (i.e., decoding speed can keep up with disk

bandwidth), we can achieve the best of both worlds: smaller and faster.

4.5.1 BYTE-ALIGNEDANDWORD-ALIGNEDCODES

In most programming languages, an integer is encoded in four bytes and holds a value between 0

and 2

32

−1, inclusive. We limit our discussion to unsigned integers, since d-gaps are always positive

(and greater than zero). This means that 1 and 4,294,967,295 both occupy four bytes. Obviously,

encoding d-gaps this way doesn’t yield any reductions in size.

A simple approach to compression is to only use as many bytes as is necessary to represent the

integer. This is known as variable-length integer coding (varInt for short) and accomplished by using

the high order bit of every byte as the continuation bit, which is set to one in the last byte and zero

elsewhere. As a result, we have 7 bits per byte for coding the value, which means that 0 ≤ n < 2

7

can be expressed with 1 byte, 2

7

≤ n < 2

14

with 2 bytes, 2

14

≤ n < 2

21

with 3, and 2

21

≤ n < 2

28

with 4 bytes. This scheme can be extended to code arbitrarily large integers (i.e., beyond 4 bytes).

As a concrete example, the two numbers:

127, 128

would be coded as such:

1 1111111, 0 0000001 1 0000000

The above code contains two code words, the ﬁrst consisting of 1 byte, and the second consisting of 2

bytes. Of course, the comma and the spaces are there only for readability. Variable-length integers are

byte-aligned because the code words always fall along byte boundaries. As a result, there is never any

ambiguity about where one code word ends and the next begins. However, the downside of varInt

coding is that decoding involves lots of bit operations (masks, shifts). Furthermore, the continuation

bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps),

which slows down processing.

A variant of the varInt scheme was described by Jeff Dean in a keynote talk at the WSDM

2009 conference.

10

The insight is to code groups of four integers at a time. Each group begins with

a preﬁx byte, divided into four 2-bit values that specify the byte length of each of the following

integers. For example, the following preﬁx byte:

00,00,01,10

indicates that the following four integers are one byte, one byte, two bytes, and three bytes, respec-

tively. Therefore, each group of four integers would consume anywhere between 5 and 17 bytes. A

simple lookup table based on the preﬁx byte directs the decoder on how to process subsequent bytes

to recover the coded integers. The advantage of this group varInt coding scheme is that values can

10

http://research.google.com/people/jeff/WSDM09-keynote.pdf

76 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

be decoded with fewer branch mispredicts and bitwise operations. Experiments reported by Dean

suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme.

In most architectures, accessing entire machine words is more efﬁcient than fetching all its

bytes separately. Therefore, it makes sense to store postings in increments of 16-bit, 32-bit, or 64-bit

machine words. Anh and Moffat [8] presented several word-aligned coding methods, one of which

is called Simple-9, based on 32-bit words. In this coding scheme, four bits in each 32-bit word are

reserved as a selector. The remaining 28 bits are used to code actual integer values. Now, there are a

variety of ways these 28 bits can be divided to code one or more integers: 28 bits can be used to code

one 28-bit integer, two 14-bit integers, three 9-bit integers (with one bit unused), etc., all the way

up to twenty-eight 1-bit integers. In fact, there are nine different ways the 28 bits can be divided

into equal parts (hence the name of the technique), some with leftover unused bits. This is stored in

the selector bits. Therefore, decoding involves reading a 32-bit word, examining the selector to see

how the remaining 28 bits are packed, and then appropriately decoding each integer. Coding works

in the opposite way: the algorithm scans ahead to see how many integers can be squeezed into 28

bits, packs those integers, and sets the selector bits appropriately.

4.5.2 BIT-ALIGNEDCODES

The advantage of byte-aligned and word-aligned codes is that they can be coded and decoded quickly.

The downside, however, is that they must consume multiples of eight bits, even when fewer bits might

sufﬁce (the Simple-9 scheme gets around this by packing multiple integers into a 32-bit word, but

even then, bits are often wasted). In bit-aligned codes, on the other hand, code words can occupy

any number of bits, meaning that boundaries can fall anywhere. In practice, coding and decoding

bit-aligned codes require processing bytes and appropriately shifting or masking bits (usually more

involved than varInt and group varInt coding).

One additional challenge with bit-aligned codes is that we need a mechanism to delimit code

words, i.e., tell where the last ends and the next begins, since there are no byte boundaries to guide

us. To address this issue, most bit-aligned codes are so-called preﬁx codes (confusingly, they are also

called preﬁx-free codes), in which no valid code word is a preﬁx of any other valid code word. For

example, coding 0 ≤ x < 3 with {0, 1, 01} is not a valid preﬁx code, since 0 is a preﬁx of 01, and

so we can’t tell if 01 is two code words or one. On the other hand, {00, 01, 1} is a valid preﬁx code,

such that a sequence of bits:

0001101001010100

can be unambiguously segmented into:

00 01 1 01 00 1 01 01 00

and decoded without any additional delimiters.

One of the simplest preﬁx codes is the unary code. An integer x > 0 is coded as x −1 one

bits followed by a zero bit. Note that unary codes do not allow the representation of zero, which

4.5. INDEXCOMPRESSION 77

Golomb

x unary γ b = 5 b = 10

1 0 0 0:00 0:000

2 10 10:0 0:01 0:001

3 110 10:1 0:10 0:010

4 1110 110:00 0:110 0:011

5 11110 110:01 0:111 0:100

6 111110 110:10 10:00 0:101

7 1111110 110:11 10:01 0:1100

8 11111110 1110:000 10:10 0:1101

9 111111110 1110:001 10:110 0:1110

10 1111111110 1110:010 10:111 0:1111

Figure 4.5: The ﬁrst ten positive integers in unary, γ , and Golomb (b = 5, 10) codes.

is ﬁne since d-gaps and term frequencies should never be zero.

11

As an example, 4 in unary code

is 1110. With unary code we can code x in x bits, which although economical for small values,

becomes inefﬁcient for even moderately large values. Unary codes are rarely used by themselves, but

form a component of other coding schemes. Unary codes of the ﬁrst ten positive integers are shown

in Figure 4.5.

Elias γ code is an efﬁcient coding scheme that is widely used in practice. An integer x > 0

is broken into two components, 1 +log

2

x (= n, the length), which is coded in unary code,

and x −2

log

2

x

(= r, the remainder), which is in binary.

12

The unary component n speciﬁes the

number of bits required to code x, and the binary component codes the remainder r in n −1 bits.

As an example, consider x = 10: 1 +log

2

10 = 4, which is 1110. The binary component codes

x −2

3

= 2 in 4 −1 = 3 bits, which is 010. Putting both together, we arrive at 1110:010. The extra

colon is inserted only for readability; it’s not part of the ﬁnal code, of course.

Working in reverse, it is easy to unambiguously decode a bit stream of γ codes: First, we read

a unary code c

u

, which is a preﬁx code. This tells us that the binary portion is written in c

u

−1 bits,

which we then read as c

b

. We can then reconstruct x as 2

c

u

−1

+c

b

. For x < 16, γ codes occupy

less than a full byte, which makes them more compact than variable-length integer codes. Since

term frequencies for the most part are relatively small, γ codes make sense for them and can yield

substantial space savings. For reference, the γ codes of the ﬁrst ten positive integers are shown in

Figure 4.5. A variation on γ code is δ code, where the n portion of the γ code is coded in γ code

itself (as opposed to unary code). For smaller values γ codes are more compact, but for larger values,

δ codes take less space.

11

As a note, some sources describe slightly different formulations of the same coding scheme. Here, we adopt the conventions used

in the classic IR text Managing Gigabytes [156].

12

Note that x is the ﬂoor function, which maps x to the largest integer not greater than x, so, e.g., 3.8 = 3. This is the default

behavior in many programming languages when casting from a ﬂoating-point type to an integer type.

78 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

Unary and γ codes are parameterless, but even better compression can be achieved with

parameterized codes. A good example of this is Golomb code. For some parameter b, an integer

x > 0 is coded in two parts: ﬁrst, we compute q = (x −1)/b and code q +1 in unary; then,

we code the remainder r = x −qb −1 in truncated binary. This is accomplished as follows: if b

is a power of two, then truncated binary is exactly the same as normal binary, requiring log

2

b

bits. Otherwise, we code the ﬁrst 2

log

2

b+1

−b values of r in log

2

b bits and code the rest of

the values of r by coding r +2

log

2

b+1

−b in ordinary binary representation using log

2

b +1

bits. In this case, the r is coded in either log

2

b or log

2

b +1 bits, and unlike ordinary binary

coding, truncated binary codes are preﬁx codes. As an example, if b = 5, then r can take the values

{0, 1, 2, 3, 4}, which would be coded with the following code words: {00, 01, 10, 110, 111}. For

reference, Golomb codes of the ﬁrst ten positive integers are shown in Figure 4.5 for b = 5 and

b = 10. A special case of Golomb code is worth noting: if b is a power of two, then coding and

decoding can be handled more efﬁciently (needing only bit shifts and bit masks, as opposed to

multiplication and division). These are known as Rice codes.

Researchers have shown that Golomb compression works well for d-gaps, and is optimal with

the following parameter setting:

b ≈ 0.69 ×

N

df

(4.1)

where df is the document frequency of the term, and N is the number of documents in the collec-

tion.

13

Putting everything together, one popular approach for postings compression is to represent

d-gaps with Golomb codes and term frequencies with γ codes [156; 162]. If positional information

is desired, we can use the same trick to code differences between term positions using γ codes.

4.5.3 POSTINGS COMPRESSION

Having completed our slight detour into integer compression techniques, we can now return to

the scalable inverted indexing algorithm shown in Figure 4.4 and discuss how postings lists can be

properly compressed. As we can see from the previous section, there is a wide range of choices that

represent different trade offs betweencompressionratioanddecoding speed. Actual performance also

depends on characteristics of the collection, which, among other factors, determine the distribution

of d-gaps. Büttcher et al. [30] recently compared the performance of various compression techniques

on coding document ids. In terms of the amount of compression that can be obtained (measured in

bits per docid), Golomb and Rice codes performed the best, followed by γ codes, Simple-9, varInt,

and group varInt (the least space efﬁcient). In terms of raw decoding speed, the order was almost

13

For details as to why this is the case, we refer the reader elsewhere [156], but here’s the intuition: under reasonable assumptions,

the appearance of postings can be modeled as a sequence of independent Bernoulli trials, which implies a certain distribution of

d-gaps. From this we can derive an optimal setting of b.

4.5. INDEXCOMPRESSION 79

the reverse: group varInt was the fastest, followed by varInt.

14

Simple-9 was substantially slower,

and the bit-aligned codes were even slower than that. Within the bit-aligned codes, Rice codes were

the fastest, followed by γ , with Golomb codes being the slowest (about ten times slower than group

varInt).

Let us discuss what modiﬁcations are necessary to our inverted indexing algorithm if we were

to adopt Golomb compression for d-gaps and represent term frequencies with γ codes. Note that

this represents a space-efﬁcient encoding, at the cost of slower decoding compared to alternatives.

Whether or not this is actually a worthwhile tradeoff in practice is not important here: use of Golomb

codes serves a pedagogical purpose, to illustrate how one might set compression parameters.

Coding term frequencies with γ codes is easy since they are parameterless. Compressing d-

gaps with Golomb codes, however, is a bit tricky, since two parameters are required: the size of the

document collection and the number of postings for a particular postings list (i.e., the document

frequency, or df). The ﬁrst is easy to obtain and can be passed into the reducer as a constant. The

df of a term, however, is not known until all the postings have been processed—and unfortunately,

the parameter must be known before any posting is coded. At ﬁrst glance, this seems like a chicken-

and-egg problem. A two-pass solution that involves ﬁrst buffering the postings (in memory) would

suffer from the memory bottleneck we’ve been trying to avoid in the ﬁrst place.

To get around this problem, we need to somehow inform the reducer of a term’s df before

any of its postings arrive. This can be solved with the order inversion design pattern introduced in

Section 3.3 to compute relative frequencies. The solution is to have the mapper emit special keys of

the formt, ∗ to communicate partial document frequencies. That is, inside the mapper, in addition

to emitting intermediate key-value pairs of the following form:

(tuple t, docid, tf f )

we also emit special intermediate key-value pairs like this:

(tuple t, ∗, df e)

to keep track of document frequencies associated with each term. In practice, we can accomplish

this by applying the in-mapper combining design pattern (see Section 3.1). The mapper holds an

in-memory associative array that keeps track of how many documents a term has been observed in

(i.e., the local document frequency of the termfor the subset of documents processed by the mapper).

Once the mapper has processed all input records, special keys of the formt, ∗ are emitted with the

partial df as the value.

To ensure that these special keys arrive ﬁrst, we deﬁne the sort order of the tuple so that the

special symbol ∗ precedes all documents (part of the order inversion design pattern). Thus, for each

term, the reducer will ﬁrst encounter the t, ∗ key, associated with a list of values representing partial

df values originating from each mapper. Summing all these partial contributions will yield the term’s

14

However, this study found less speed difference between group varInt and basic varInt than Dean’s analysis, presumably due to

the different distribution of d-gaps in the collections they were examining.

80 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

df, which can then be used to set the Golomb compression parameter b. This allows the postings

to be incrementally compressed as they are encountered in the reducer—memory bottlenecks are

eliminated since we do not need to buffer postings in memory.

Once again, the order inversion design pattern comes to the rescue. Recall that the pattern is

useful when a reducer needs to access the result of a computation (e.g., an aggregate statistic) before

it encounters the data necessary to produce that computation. For computing relative frequencies,

that bit of information was the marginal. In this case, it’s the document frequency.

4.6 WHATABOUTRETRIEVAL?

Thus far, we have brieﬂy discussed web crawling and focused mostly on MapReduce algorithms for

inverted indexing. What about retrieval? It should be fairly obvious that MapReduce, which was

designed for large batch operations, is a poor solution for retrieval. Since users demand sub-second

response times, every aspect of retrieval must be optimized for low latency, which is exactly the

opposite tradeoff made in MapReduce. Recall the basic retrieval problem: we must look up postings

lists corresponding to query terms, systematically traverse those postings lists to compute query–

document scores, and then return the top k results to the user. Looking up postings implies random

disk seeks, since for the most part postings are too large to ﬁt into memory (leaving aside caching

and other special cases for now). Unfortunately, random access is not a forte of the distributed ﬁle

system underlying MapReduce—such operations require multiple round-trip network exchanges

(and associated latencies). In HDFS, a client must ﬁrst obtain the location of the desired data block

from the namenode before the appropriate datanode can be contacted for the actual data. Of course,

access will typically require a random disk seek on the datanode itself.

It should be fairly obvious that serving the search needs of a large number of users, each of

whom demand sub-second response times, is beyond the capabilities of any single machine. The

only solution is to distribute retrieval across a large number of machines, which necessitates breaking

up the index in some manner. There are two main partitioning strategies for distributed retrieval:

document partitioning and term partitioning. Under document partitioning, the entire collection is

broken up into multiple smaller sub-collections, each of which is assigned to a server. In other

words, each server holds the complete index for a subset of the entire collection. This corresponds

to partitioning vertically in Figure 4.6. With term partitioning, on the other hand, each server is

responsible for a subset of the terms for the entire collection. That is, a server holds the postings for

all documents in the collection for a subset of terms. This corresponds to partitioning horizontally

in Figure 4.6.

Document and term partitioning require different retrieval strategies and represent different

trade offs. Retrieval under document partitioning involves a query broker, which forwards the user’s

query to all partition servers, merges partial results fromeach, and then returns the ﬁnal results to the

user. With this architecture, searching the entire collection requires that the query be processed by

every partition server. However, since each partition operates independently and traverses postings

4.6. WHATABOUTRETRIEVAL? 81

d

1

d

2

d

3

d

4

d

5

d

6

d

7

d

8

d

9

2 3

1 1 4

1 1 2

t

1

t

2

t

3

partition

a

5 2 2

1 1 3

2 1

t

4

t

5

t

partition

b

2 1

2 1 4

1 2 3

t

6

t

7

t

8

partition

c

1 2 1 t

9

partition

1

partition

2

partition

3

Figure 4.6: Term–document matrix for a toy collection(nine documents, nine terms) illustrating different

partitioning strategies: partitioning vertically (1, 2, 3) corresponds to document partitioning, whereas

partitioning horizontally (a, b, c) corresponds to term partitioning.

in parallel, document partitioning typically yields shorter query latencies (compared to a single

monolithic index with much longer postings lists).

Retrieval under termpartitioning, onthe other hand, requires a very different strategy. Suppose

the user’s query Qcontains three terms, q

1

, q

2

, and q

3

. Under the pipelined query evaluation strategy,

the broker begins by forwarding the query to the server that holds the postings for q

1

(usually the

least frequent term). The server traverses the appropriate postings list and computes partial query–

document scores, stored in the accumulators. The accumulators are then passed to the server that

holds the postings associated with q

2

for additional processing, and then to the server for q

3

, before

ﬁnal results are passed back to the broker and returned to the user. Although this query evaluation

strategy may not substantially reduce the latency of any particular query, it can theoretically increase

a system’s throughput due to the far smaller number of total disk seeks required for each user

query (compared to document partitioning). However, load-balancing is tricky in a pipelined term-

partitioned architecture due to skew in the distribution of query terms, which can create “hot spots”

on servers that hold the postings for frequently occurring query terms.

82 4. INVERTEDINDEXINGFORTEXTRETRIEVAL

In general, studies have shown that document partitioning is a better strategy overall [109],

and this is the strategy adopted by Google [16]. Furthermore, it is known that Google maintains its

indexes in memory (although this is certainly not the common case for search engines in general).

One key advantage of document partitioning is that result quality degrades gracefully with machine

failures. Partition servers that are ofﬂine will simply fail to deliver results for their subsets of the

collection. With sufﬁcient partitions, users might not even be aware that documents are missing.

For most queries, the web contains more relevant documents than any user has time to digest:

users of course care about getting relevant documents (sometimes, they are happy with a single

relevant document), but they are generally less discriminating when it comes to which relevant

documents appear in their results (out of the set of all relevant documents). Note that partitions may

be unavailable due to reasons other than machine failure: cycling through different partitions is a

very simple and non-disruptive strategy for index updates.

Working in a document-partitioned architecture, there are a variety of approaches to dividing

up the web into smaller pieces. Proper partitioning of the collection can address one major weakness

of this architecture, which is that every partition server is involved in every user query. Along one

dimension, it is desirable to partition by document quality using one or more classiﬁers; see [124] for

a recent survey on web page classiﬁcation. Partitioning by document quality supports a multi-phase

search strategy: the system examines partitions containing high quality documents ﬁrst, and only

backs off to partitions containing lower quality documents if necessary. This reduces the number of

servers that need to be contacted for a user query. Along an orthogonal dimension, it is desirable

to partition documents by content (perhaps also guided by the distribution of user queries from

logs), so that each partition is “well separated” from the others in terms of topical coverage. This

also reduces the number of machines that need to be involved in serving a user’s query: the broker

can direct queries only to the partitions that are likely to contain relevant documents, as opposed to

forwarding the user query to all the partitions.

On a large scale, reliability of service is provided by replication, both in terms of multiple

machines serving the same partition within a single datacenter, but also replication across geograph-

ically distributed datacenters. This creates at least two query routing problems: since it makes sense

to serve clients from the closest datacenter, a service must route queries to the appropriate location.

Within a single datacenter, the system needs to properly balance load across replicas.

There are two ﬁnal components of real-world search engines that are worth discussing. First,

recall that postings only store document ids. Therefore, raw retrieval results consist of a ranked list

of semantically meaningless document ids. It is typically the responsibility of document servers,

functionally distinct from the partition servers holding the indexes, to generate meaningful output

for user presentation. Abstractly, a document server takes as input a query and a document id, and

computes an appropriate result entry, typically comprising the title and URL of the page, a snippet

of the source document showing the user’s query terms in context, and additional metadata about

the document. Second, query evaluation can beneﬁt immensely from caching, of individual postings

(assuming that the index is not already in memory) and even results of entire queries [13]. This is

4.7. SUMMARY ANDADDITIONAL READINGS 83

made possible by the Zipﬁan distribution of queries, with very frequent queries at the head of the

distribution dominating the total number of queries. Search engines take advantage of this with

cache servers, which are functionally distinct from all of the components discussed above.

4.7 SUMMARY ANDADDITIONAL READINGS

Web search is a complex problem that breaks down into three conceptually distinct components.

First, the documents collection must be gathered (by crawling the web). Next, inverted indexes and

other auxiliary data structures must be built from the documents. Both of these can be considered

ofﬂine problems. Finally, index structures must be accessed and processed in response to user queries

to generate search results. This last task is an online problem that demands both low latency and

high throughput.

This chapter primarily focused on building inverted indexes, the problem most suitable for

MapReduce. After all, inverted indexing is nothing but a very large distributed sort and group by

operation! We began with a baseline implementation of an inverted indexing algorithm, but quickly

noticed a scalability bottleneck that stemmed fromhaving to buffer postings in memory. Application

of the value-to-key conversion design pattern (Section 3.4) addressed the issue by ofﬂoading the

task of sorting postings by document id to the MapReduce execution framework. We also surveyed

various techniques for integer compression, which yield postings lists that are both more compact

and faster to process. As a speciﬁc example, one could use Golomb codes for compressing d-gaps

and γ codes for term frequencies. We showed how the order inversion design pattern introduced in

Section 3.3 for computing relative frequencies can be used to properly set compression parameters.

Additional Readings. Our brief discussion of web search glosses over many complexities and does

a huge injustice to the tremendous amount of research in information retrieval. Here, however, we

provide a few entry points into the literature. A survey article by Zobel and Moffat [162] is an excel-

lent starting point on indexing and retrieval algorithms. Another by Baeza-Yates et al. [11] overviews

many important issues in distributed retrieval. Akeynote talk at the WSDM2009 conference by Jeff

Dean revealed a lot of information about the evolution of the Google search architecture.

15

Finally,

a number of general information retrieval textbooks have been recently published [30; 42; 101]. Of

these three, the one by Büttcher et al. [30] is noteworthy in having detailed experimental evaluations

that compare the performance (both effectiveness and efﬁciency) of a wide range of algorithms and

techniques. While outdated in many other respects, the textbook Managing Gigabytes [156] remains

an excellent source for index compression techniques. Finally, ACM SIGIR is an annual conference

and the most prestigious venue for academic information retrieval research; proceedings from those

events are perhaps the best starting point for those wishing to keep abreast of publicly documented

developments in the ﬁeld.

15

http://research.google.com/people/jeff/WSDM09-keynote.pdf

85

C H A P T E R 5

Graph Algorithms

Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily

basis include the hyperlink structure of the web (simply known as the web graph), social networks

(manifest in the ﬂow of email, phone call patterns, connections on social networking sites, etc.), and

transportation networks (roads, bus routes, ﬂights, etc.). Our very own existence is dependent on an

intricate metabolic and regulatory network, which can be characterized as a large, complex graph

involving interactions between genes, proteins, and other cellular products. This chapter focuses

on graph algorithms in MapReduce. Although most of the content has nothing to do with text

processing per se, documents frequently exist in the context of some underlying network, making

graphanalysis animportant component of many text processing applications. Perhaps the best known

example is PageRank, a measure of web page quality based on the structure of hyperlinks, which is

used in ranking results for web search. As one of the ﬁrst applications of MapReduce, PageRank

exempliﬁes a large class of graph algorithms that can be concisely captured in the programming

model. We will discuss PageRank in detail later this chapter.

In general, graphs can be characterized by nodes (or vertices) and links (or edges) that connect

pairs of nodes.

1

These connections can be directed or undirected. In some graphs, there may be an

edge froma node to itself, resulting in a self loop; in others, such edges are disallowed. We assume that

both nodes and links may be annotated with additional metadata: as a simple example, in a social

network where nodes represent individuals, there might be demographic information (e.g., age,

gender, location) attached to the nodes and type information attached to the links (e.g., indicating

type of relationship such as “friend” or “spouse”).

Mathematicians have always been fascinated with graphs, dating back to Euler’s paper on the

Seven Bridges of Königsberg in 1736. Over the past fewcenturies, graphs have been extensively studied,

and today much is known about their properties. Far more than theoretical curiosities, theorems and

algorithms on graphs can be applied to solve many real-world problems:

• Graph search and path planning. Search algorithms on graphs are invoked millions of times a

day, whenever anyone searches for directions on the web. Similar algorithms are also involved

in friend recommendations and expert-ﬁnding in social networks. Path planning problems

involving everything from network packets to delivery trucks represent another large class of

graph search problems.

• Graph clustering. Can a large graph be divided into components that are relatively disjoint (for

example, as measured by inter-component links [59])? Among other applications, this task is

1

Throughout this chapter, we use node interchangeably with vertex and similarly with link and edge.

86 5. GRAPHALGORITHMS

useful for identifying communities in social networks (of interest to sociologists who wish to

understand how human relationships form and evolve) and for partitioning large graphs (of

interest to computer scientists who seek to better parallelize graph processing). See [158] for

a survey.

• Minimum spanning trees. A minimum spanning tree for a graph G with weighted edges is

a tree that contains all vertices of the graph and a subset of edges connecting all the vertices

together that minimizes the sum of edge weights. A real-world example of this problem is

a telecommunications company that wishes to lay optical ﬁber to span a number of destina-

tions at the lowest possible cost (where weights denote costs). This approach has also been

applied to wide variety of problems, including social networks and the migration of Polynesian

islanders [64].

• Bipartite graph matching. A bipartite graph is one whose vertices can be divided into two

disjoint sets. Matching problems on such graphs can be used to model job seekers looking for

employment or singles looking for dates.

• Maximum ﬂow. In a weighted directed graph with two special nodes called the source and

the sink, the max ﬂow problem involves computing the amount of “trafﬁc” that can be sent

from source to sink given various ﬂow capacities deﬁned by edge weights. Transportation

companies (airlines, shipping, etc.) and network operators grapple with complex versions of

these problems on a daily basis.

• Identifying “special” nodes. There are many ways to deﬁne what special means, including

metrics based on node in-degree, average distance to other nodes, and relationship to cluster

structure. These special nodes are important to investigators attempting to break up terrorist

cells, epidemiologists modeling the spread of diseases, advertisers trying to promote products,

and many others.

A common feature of these problems is the scale of the datasets on which the algorithms must

operate: for example, the hyperlink structure of the web, which contains billions of pages, or social

networks that contain hundreds of millions of individuals. Clearly, algorithms that run on a single

machine and depend on the entire graph residing in memory are not scalable. We’d like to put

MapReduce to work on these challenges.

2

This chapter is organized as follows: we begin in Section 5.1 with an introduction to graph

representations, and then explore two classic graph algorithms in MapReduce: parallel breadth-ﬁrst

search (Section 5.2) and PageRank (Section 5.3). Before concluding with a summary and pointing

out additional readings, Section 5.4 discusses a number of general issue that affect graph processing

with MapReduce.

2

As a side note, Google recently published a short description of a system called Pregel [98], based on Valiant’s Bulk Synchronous

Parallel model [148], for large-scale graph algorithms; a longer description is anticipated in a forthcoming paper [99].

5.1. GRAPHREPRESENTATIONS 87

n

1

n

2

n

1

n

2

n

3

n

4

n

5

n

1

0 1 0 1 0

n

2

0 0 1 0 1

n

1

[n

2

, n

4

]

n

2

[n

3

, n

5

]

n

3

n

5

n

2

0 0 1 0 1

n

3

0 0 0 1 0

n

4

0 0 0 0 1

n

5

1 1 1 0 0

n

2

[n

3

, n

5

]

n

3

[n

4

]

n

4

[n

5

]

n

5

[n

1

, n

2

, n

3

]

n

4

adjacency matrix adjacency lists

Figure 5.1: Asimple directed graph(left) represented as anadjacency matrix (middle) and withadjacency

lists (right).

5.1 GRAPHREPRESENTATIONS

One common way to represent a graph is with an adjacency matrix. A graph with n nodes can be

represented as an n ×n square matrix M, where a value in cell m

ij

indicates an edge from node n

i

to

node n

j

. In the case of graphs with weighted edges, the matrix cells contain edge weights; otherwise,

each cell contains either a one (indicating an edge), or a zero (indicating none). With undirected

graphs, only half the matrix is used (e.g., cells above the diagonal). For graphs that allow self loops

(a directed edge from a node to itself ), the diagonal might be populated; otherwise, the diagonal

remains empty. Figure 5.1 provides an example of a simple directed graph (left) and its adjacency

matrix representation (middle).

Althoughmathematicians prefer the adjacency matrix representationof graphs for easy manip-

ulation with linear algebra, such a representation is far from ideal for computer scientists concerned

with efﬁcient algorithmic implementations. Most of the applications discussed in the chapter in-

troduction involve sparse graphs, where the number of actual edges is far smaller than the number

of possible edges.

3

For example, in a social network of n individuals, there are n(n −1)/2 possible

“friendships” (where n may be on the order of hundreds of millions). However, even the most gre-

garious will have relatively few friends compared to the size of the network (thousands, perhaps, but

still far smaller than hundreds of millions). The same is true for the hyperlink structure of the web:

each individual web page links to a minuscule portion of all the pages on the web. In this chapter,

we assume processing of sparse graphs, although we will return to this issue in Section 5.4.

The major problemwithanadjacency matrix representationfor sparse graphs is its O(n

2

) space

requirement. Furthermore, most of the cells are zero, by deﬁnition. As a result, most computational

implementations of graph algorithms operate over adjacency lists, in which a node is associated

with neighbors that can be reached via outgoing edges. Figure 5.1 also shows the adjacency list

representation of the graph under consideration (on the right). For example, since n

1

is connected

3

Unfortunately, there is no precise deﬁnition of sparseness agreed upon by all, but one common deﬁnition is that a sparse graph

has O(n) edges, where n is the number of vertices.

88 5. GRAPHALGORITHMS

by directed edges to n

2

and n

4

, those two nodes will be on the adjacency list of n

1

. There are two

options for encoding undirected graphs: one could simply encode each edge twice (if n

i

and n

j

are

connected, each appears on each other’s adjacency list). Alternatively, one could order the nodes

(arbitrarily or otherwise) and encode edges only on the adjacency list of the node that comes ﬁrst in

the ordering (i.e., if i < j, then n

j

is on the adjacency list of n

i

, but not the other way around).

Note that certain graph operations are easier on adjacency matrices than on adjacency lists.

In the ﬁrst, operations on incoming links for each node translate into a column scan on the matrix,

whereas operations on outgoing links for each node translate into a row scan. With adjacency lists,

it is natural to operate on outgoing links, but computing anything that requires knowledge of the

incoming links of a node is difﬁcult. However, as we shall see, the shufﬂe and sort mechanism in

MapReduce provides an easy way to group edges by their destination nodes, thus allowing us to

compute over incoming edges in the reducer. This property of the execution framework can also be

used to invert the edges of a directed graph, by mapping over the nodes’ adjacency lists and emitting

key–value pairs with the destination node id as the key and the source node id as the value.

4

5.2 PARALLEL BREADTH-FIRSTSEARCH

One of the most common and well-studied problems in graph theory is the single-source shortest path

problem, where the task is to ﬁnd shortest paths from a source node to all other nodes in the graph

(or alternatively, edges can be associated with costs or weights, in which case the task is to compute

lowest-cost or lowest-weight paths). Such problems are a staple in undergraduate algorithm courses,

where students are taught the solution using Dijkstra’s algorithm. However, this famous algorithm

assumes sequential processing—how would we solve this problem in parallel, and more speciﬁcally,

with MapReduce?

As a refresher and also to serve as a point of comparison, Dijkstra’s algorithm is shown in

Figure 5.2, adapted from Cormen, Leiserson, and Rivest’s classic algorithms textbook [41] (often

simply known as CLR). The input to the algorithm is a directed, connected graph G = (V, E)

represented with adjacency lists, w containing edge distances such that w(u, v) ≥ 0, and the source

node s. The algorithm begins by ﬁrst setting distances to all vertices d[v], v ∈ V to ∞, except for

the source node, whose distance to itself is zero. The algorithm maintains Q, a global priority queue

of vertices with priorities equal to their distance values d.

Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance

from the priority queue (initially, this is the source node). At each iteration, the algorithm“expands”

that node by traversing the adjacency list of the selected node to see if any of those nodes can be

reached with a path of a shorter distance. The algorithm terminates when the priority queue Q is

empty, or equivalently, when all nodes have been considered. Note that the algorithm as presented

4

This technique is used in anchor text inversion, where one gathers the anchor text of hyperlinks pointing to a particular page. It is

common practice to enrich a web page’s standard textual representation with all of the anchor texts associated with its incoming

hyperlinks (e.g., [107]).

5.2. PARALLEL BREADTH-FIRSTSEARCH 89

1: Dijkstra(G, w, s)

2: d[s] ← 0

3: for all vertex v ∈ V do

4: d[v] ← ∞

5: Q ← {V}

6: while Q = ∅ do

7: u ← ExtractMin(Q)

8: for all vertex v ∈ u.AdjacencyList do

9: if d[v] > d[u] +w(u, v) then

10: d[v] ← d[u] +w(u, v)

Figure 5.2: Pseudo-code for Dijkstra’s algorithm, which is based on maintaining a global priority queue

of nodes with priorities equal to their distances from the source node. At each iteration, the algorithm

expands the node with the shortest distance and updates distances to all reachable nodes.

in Figure 5.2 only computes the shortest distances. The actual paths can be recovered by storing

“backpointers” for every node indicating a fragment of the shortest path.

A sample trace of the algorithm running on a simple graph is shown in Figure 5.3 (example

also adapted from CLR). We start out in (a) with n

1

having a distance of zero (since it’s the source)

and all other nodes having a distance of ∞. In the ﬁrst iteration (a), n

1

is selected as the node to

expand (indicated by the thicker border). After the expansion, we see in (b) that n

2

and n

3

can be

reached at a distance of 10 and 5, respectively. Also, we see in (b) that n

3

is the next node selected

for expansion. Nodes we have already considered for expansion are shown in black. Expanding n

3

,

we see in (c) that the distance to n

2

has decreased because we’ve found a shorter path. The nodes

that will be expanded next, in order, are n

5

, n

2

, and n

4

. The algorithm terminates with the end state

shown in (f ), where we’ve discovered the shortest distance to all nodes.

The key to Dijkstra’s algorithm is the priority queue that maintains a globally sorted list of

nodes by current distance. This is not possible in MapReduce, as the programming model does not

provide a mechanism for exchanging global data. Instead, we adopt a brute force approach known as

parallel breadth-ﬁrst search. First, as a simpliﬁcation let us assume that all edges have unit distance

(modeling, for example, hyperlinks on the web). This makes the algorithm easier to understand, but

we’ll relax this restriction later.

The intuition behind the algorithm is this: the distance of all nodes connected directly to the

source node is one; the distance of all nodes directly connected to those is two; and so on. Imagine

water rippling away from a rock dropped into a pond—that’s a good image of how parallel breadth-

ﬁrst search works. However, what if there are multiple paths to the same node? Suppose we wish to

compute the shortest distance to node n. The shortest path must go through one of the nodes in M

that contains an outgoing edge to n: we need to examine all m ∈ M to ﬁnd m

s

, the node with the

shortest distance. The shortest distance to n is the distance to m

s

plus one.

90 5. GRAPHALGORITHMS

f f

1

n

2

n

4

10 f

1

n

2

n

4

8 14

1

n

2

n

4

0

f f

10

5

2 3

9

4 6

n

1

0

10 f

10

5

2 3

9

4 6

n

1

0

8 14

10

5

2 3

9

4 6

n

1

f f

5

2

7

1

n

3

n

5

5 f

5

2

7

1

n

3

n

5

5 7

5

2

7

1

n

3

n

5

(a) (b) (c)

8 13

10

1

n

2

n

4

8 9

10

1

n

2

n

4

8 9

10

1

n

2

n

4

0

5 7

10

5

2 3

9

7

4 6

n

1

0

5 7

10

5

2 3

9

7

4 6

n

1

0

5 7

10

5

2 3

9

7

4 6

n

1

5 7

2

n

3

n

5

5 7

2

n

3

n

5

5 7

2

n

3

n

5

(d) (e) (f)

Figure 5.3: Example of Dijkstra’s algorithm applied to a simple graph with ﬁve nodes, with n

1

as the

source and edge distances as indicated. Parts (a)–(e) show the running of the algorithm at each iteration,

with the current distance inside the node. Nodes with thicker borders are those being expanded; nodes

that have already been expanded are shown in black.

Pseudo-code for the implementation of the parallel breadth-ﬁrst search algorithm is provided

in Figure 5.4. As with Dijkstra’s algorithm, we assume a connected, directed graph represented as

adjacency lists. Distance to each node is directly stored alongside the adjacency list of that node, and

initialized to ∞ for all nodes except for the source node. In the pseudo-code, we use n to denote

the node id (an integer) and N to denote the node’s corresponding data structure (adjacency list and

current distance). The algorithm works by mapping over all nodes and emitting a key-value pair for

each neighbor on the node’s adjacency list. The key contains the node id of the neighbor, and the

value is the current distance to the node plus one. This says: if we can reach node n with a distance

d, then we must be able to reach all the nodes that are connected to n with distance d +1. After

shufﬂe and sort, reducers will receive keys corresponding to the destination node ids and distances

corresponding to all paths leading to that node. The reducer will select the shortest of these distances

and then update the distance in the node data structure.

It is apparent that parallel breadth-ﬁrst search is an iterative algorithm, where each iteration

corresponds to a MapReduce job. The ﬁrst time we run the algorithm, we “discover” all nodes that

are connected to the source. The second iteration, we discover all nodes connected to those, and

5.2. PARALLEL BREADTH-FIRSTSEARCH 91

1: class Mapper

2: method Map(nid n, node N)

3: d ← N.Distance

4: Emit(nid n, N) Pass along graph structure

5: for all nodeid m ∈ N.AdjacencyList do

6: Emit(nid m, d +1) Emit distances to reachable nodes

1: class Reducer

2: method Reduce(nid m, [d

1

, d

2

, . . .])

3: d

min

← ∞

4: M ← ∅

5: for all d ∈ counts [d

1

, d

2

, . . .] do

6: if IsNode(d) then

7: M ← d Recover graph structure

8: else if d < d

min

then Look for shorter distance

9: d

min

← d

10: M.Distance← d

min

Update shortest distance

11: Emit(nid m, node M)

Figure 5.4: Pseudo-code for parallel breath-ﬁrst search in MapReduce: the mappers emit distances to

reachable nodes, while the reducers select the minimum of those distances for each destination node.

Each iteration (one MapReduce job) of the algorithm expands the “search frontier” by one hop.

so on. Each iteration of the algorithm expands the “search frontier” by one hop, and, eventually, all

nodes will be discovered with their shortest distances (assuming a fully-connected graph). Before

we discuss termination of the algorithm, there is one more detail required to make the parallel

breadth-ﬁrst search algorithm work. We need to “pass along” the graph structure from one iteration

to the next. This is accomplished by emitting the node data structure itself, with the node id as a key

(Figure 5.4, line 4 in the mapper). In the reducer, we must distinguish the node data structure from

distance values (Figure 5.4, lines 5–6 in the reducer), and update the minimum distance in the node

data structure before emitting it as the ﬁnal value. The ﬁnal output is now ready to serve as input to

the next iteration.

5

So how many iterations are necessary to compute the shortest distance to all nodes? The

answer is the diameter of the graph, or the greatest distance between any pair of nodes. This number

is surprisingly small for many real-world problems: the saying “six degrees of separation” suggests

that everyone on the planet is connected to everyone else by at most six steps (the people a person

knows are one step away, people that they know are two steps away, etc.). If this is indeed true,

5

Note that in this algorithm we are overloading the value type, which can either be a distance (integer) or a complex data structure

representing a node. The best way to achieve this in Hadoop is to create a wrapper object with an indicator variable specifying

what the content is.

92 5. GRAPHALGORITHMS

then parallel breadth-ﬁrst search on the global social network would take at most six MapReduce

iterations. For more serious academic studies of “small world” phenomena in networks, we refer the

reader to a number of publications [2; 61; 62; 152]. In practical terms, we iterate the algorithm until

there are no more node distances that are ∞. Since the graph is connected, all nodes are reachable,

and since all edge distances are one, all discovered nodes are guaranteed to have the shortest distances

(i.e., there is not a shorter path that goes through a node that hasn’t been discovered).

The actual checking of the termination condition must occur outside of MapReduce.Typically,

execution of an iterative MapReduce algorithm requires a non-MapReduce “driver” program, which

submits a MapReduce job to iterate the algorithm, checks to see if a termination condition has been

met, and if not, repeats. Hadoop provides a lightweight API for constructs called “counters”, which,

as the name suggests, can be used for counting events that occur during execution, e.g., number

of corrupt records, number of times a certain condition is met, or anything that the programmer

desires. Counters can be deﬁned to count the number of nodes that have distances of ∞: at the end

of the job, the driver program can access the ﬁnal counter value and check to see if another iteration

is necessary.

Finally, as with Dijkstra’s algorithm in the form presented earlier, the parallel breadth-ﬁrst

search algorithm only ﬁnds the shortest distances, not the actual shortest paths. However, the path

can be straightforwardly recovered. Storing “backpointers” at each node, as with Dijkstra’s algorithm,

will work, but may not be efﬁcient since the graph needs to be traversed again to reconstruct the

path segments. A simpler approach is to emit paths along with distances in the mapper, so that each

node will have its shortest path easily accessible at all times. The additional space requirements for

shufﬂing these data from mappers to reducers are relatively modest, since for the most part paths

(i.e., sequence of node ids) are relatively short.

Up until now, we have been assuming that all edges are unit distance. Let us relax that

restriction and see what changes are required in the parallel breadth-ﬁrst search algorithm. The

adjacency lists, which were previously lists of node ids, must now encode the edge distances as well.

In line 6 of the mapper code in Figure 5.4, instead of emitting d +1 as the value, we must now

emit d +w where w is the edge distance. No other changes to the algorithm are required, but

the termination behavior is very different. To illustrate, consider the graph fragment in Figure 5.5,

where s is the source node, and in this iteration, we just “discovered” node r for the very ﬁrst time.

Assume for the sake of argument that we’ve already discovered the shortest distance to node p,

and that the shortest distance to r so far goes through p. This, however, does not guarantee that

we’ve discovered the shortest distance to r, since there may exist a path going through q that we

haven’t encountered yet (because it lies outside the search frontier).

6

However, as the search frontier

expands, we’ll eventually cover q and all other nodes along the path from p to q to r—which means

that with sufﬁcient iterations, we will discover the shortest distance to r. But how do we know that

we’ve found the shortest distance to p? Well, if the shortest path to p lies within the search frontier,

6

Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside the search frontier

since any such path would necessarily be longer.

5.2. PARALLEL BREADTH-FIRSTSEARCH 93

r

search frontier

s

p

q

Figure 5.5: In the single-source shortest path problem with arbitrary edge distances, the shortest path

from source s to node r may go outside the current search frontier, in which case we will not ﬁnd the

shortest distance to r until the search frontier expands to cover q.

n

6

n

7

n

8

1

1

1

10

n

1

n

5

n

9

1

1

1

n

2

n

3

n

4

1

1

Figure 5.6: A sample graph that elicits worst-case behavior for parallel breadth-ﬁrst search. Eight iter-

ations are required to discover shortest distances to all nodes from n

1

.

we would have already discovered it. And if it doesn’t, the above argument applies. Similarly, we

can repeat the same argument for all nodes on the path from s to p. The conclusion is that, with

sufﬁcient iterations, we’ll eventually discover all the shortest distances.

So exactly how many iterations does “eventually” mean? In the worst case, we might need as

many iterations as there are nodes in the graph minus one. In fact, it is not difﬁcult to construct

graphs that will elicit this worse-case behavior: Figure 5.6 provides an example, with n

1

as the source.

The parallel breadth-ﬁrst search algorithm would not discover that the shortest path from n

1

to n

6

goes through n

3

, n

4

, and n

5

until the ﬁfth iteration. Three more iterations are necessary to cover

the rest of the graph. Fortunately, for most real-world graphs, such extreme cases are rare, and the

number of iterations necessary to discover all shortest distances is quite close to the diameter of the

graph, as in the unit edge distance case.

In practical terms, how do we know when to stop iterating in the case of arbitrary edge

distances? The algorithm can terminate when shortest distances at every node no longer change.

94 5. GRAPHALGORITHMS

Once again, we can use counters to keep track of such events. Every time we encounter a shorter

distance in the reducer, we increment a counter. At the end of each MapReduce iteration, the driver

program reads the counter value and determines if another iteration is necessary.

Compared to Dijkstra’s algorithm on a single processor, parallel breadth-ﬁrst search in Map-

Reduce can be characterized as a brute force approach that “wastes” a lot of time performing compu-

tations whose results are discarded. At each iteration, the algorithm attempts to recompute distances

to all nodes, but in reality only useful work is done along the search frontier: inside the search frontier,

the algorithmis simply repeating previous computations.

7

Outside the search frontier, the algorithm

hasn’t discovered any paths to nodes there yet, so no meaningful work is done. Dijkstra’s algorithm, on

the other hand, is far more efﬁcient. Every time a node is explored, we’re guaranteed to have already

found the shortest path to it. However, this is made possible by maintaining a global data structure

(a priority queue) that holds nodes sorted by distance—this is not possible in MapReduce because

the programming model does not provide support for global data that is mutable and accessible by

the mappers and reducers. These inefﬁciencies represent the cost of parallelization.

The parallel breadth-ﬁrst search algorithm is instructive in that it represents the prototypical

structure of a large class of graph algorithms in MapReduce. They share in the following character-

istics:

• The graph structure is represented with adjacency lists, which is part of some larger node

data structure that may contain additional information (variables to store intermediate output,

features of the nodes). In many cases, features are attached to edges as well (e.g., edge weights).

• The MapReduce algorithm maps over the node data structures and performs a computation

that is a function of features of the node, intermediate state attached to each node, and features

of the adjacency list (outgoing edges and their features). In other words, computations can only

involve a node’s internal state andits local graphstructure.The results of these computations are

emitted as values, keyed with the node ids of the neighbors (i.e., those nodes on the adjacency

lists). Conceptually, we can think of this as “passing” the results of the computation along

outgoing edges. In the reducer, the algorithm receives all partial results that have the same

destination node, and performs another computation (usually, some form of aggregation).

• In addition to computations, the graph itself is also passed from the mapper to the reducer.

In the reducer, the data structure corresponding to each node is updated and written back to

disk.

• Graph algorithms in MapReduce are generally iterative, where the output of the previous

iteration serves as input to the next iteration. The process is controlled by a non-MapReduce

driver program that checks for termination.

7

Unless the algorithmdiscovers an instance of the situation described in Figure 5.5, in which case, updated distances will propagate

inside the search frontier.

5.3. PAGERANK 95

For parallel breadth-ﬁrst search, the mapper computation is the current distance plus edge distance

(emitting distances to neighbors), while the reducer computation is the Min function (selecting the

shortest path). As we will see in the next section, the MapReduce algorithm for PageRank works in

much the same way.

5.3 PAGERANK

PageRank [117] is a measure of web page quality based on the structure of the hyperlink graph.

Although it is only one of thousands of features that is taken into account in Google’s search

algorithm, it is perhaps one of the best known and most studied.

A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer visits a

page, randomly clicks a link on that page, and repeats ad inﬁnitum. PageRank is a measure of how

frequently a page would be encountered by our tireless web surfer. More precisely, PageRank is a

probability distribution over nodes in the graph representing the likelihood that a random walk over

the link structure will arrive at a particular node. Nodes that have high in-degrees tend to have high

PageRank values, as well as nodes that are linked to by other nodes with high PageRank values.

This behavior makes intuitive sense: if PageRank is a measure of page quality, we would expect

high-quality pages to contain “endorsements” from many other pages in the form of hyperlinks.

Similarly, if a high-quality page links to another page, then the second page is likely to be high

quality also. PageRank represents one particular approach to inferring the quality of a web page

based on hyperlink structure; two other popular algorithms, not covered here, are SALSA [88] and

HITS [84] (also known as “hubs and authorities”).

The complete formulation of PageRank includes an additional component. As it turns out,

our web surfer doesn’t just randomly click links. Before the surfer decides where to go next, a biased

coin is ﬂipped—heads, the surfer clicks on a random link on the page as usual. Tails, however, the

surfer ignores the links on the page and randomly “jumps” or “teleports” to a completely different

page.

But enough about random web surﬁng. Formally, the PageRank P of a page n is deﬁned as

follows:

P(n) = α

_

1

|G|

_

+(1 −α)

m∈L(n)

P(m)

C(m)

(5.1)

where |G| is the total number of nodes (pages) in the graph, α is the random jump factor, L(n) is

the set of pages that link to n, and C(m) is the out-degree of node m (the number of links on page

m). The random jump factor α is sometimes called the “teleportation” factor; alternatively, (1 −α)

is referred to as the “damping” factor.

Let us break down each component of the formula in detail. First, note that PageRank is

deﬁned recursively—this gives rise to an iterative algorithm we will detail in a bit. A web page n

receives PageRank “contributions” from all pages that link to it, L(n). Let us consider a page m from

the set of pages L(n): a random surfer at m will arrive at n with probability 1/C(m) since a link is

96 5. GRAPHALGORITHMS

selected at random from all outgoing links. Since the PageRank value of mis the probability that the

random surfer will be at m, the probability of arriving at n from m is P(m)/C(m). To compute the

PageRank of n, we need to sum contributions from all pages that link to n. This is the summation

in the second half of the equation. However, we also need to take into account the random jump:

there is a 1/|G| chance of landing at any particular page, where |G| is the number of nodes in the

graph. Of course, the two contributions need to be combined: with probability α the random surfer

executes a random jump, and with probability 1 −α the random surfer follows a hyperlink.

Note that PageRank assumes a community of honest users who are not trying to “game” the

measure. This is, of course, not true in the real world, where an adversarial relationship exists between

search engine companies and a host of other organizations and individuals (marketers, spammers,

activists, etc.) who are trying to manipulate search results—to promote a cause, product, or service, or

in some cases, to trap and intentionally deceive users (see, for example, [12; 63]). A simple example

is a so-called “spider trap”, a inﬁnite chain of pages (e.g., generated by CGI) that all link to a single

page (thereby artiﬁcially inﬂating its PageRank). For this reason, PageRank is only one of thousands

of features used in ranking web pages.

The fact that PageRank is recursively deﬁned translates into an iterative algorithm which is

quite similar in basic structure to parallel breadth-ﬁrst search. We start by presenting an informal

sketch. At the beginning of each iteration, a node passes its PageRank contributions to other nodes

that it is connected to. Since PageRank is a probability distribution, we can think of this as spreading

probability mass to neighbors via outgoing links. To conclude the iteration, each node sums up all

PageRank contributions that have been passed to it and computes an updated PageRank score. We

can think of this as gathering probability mass passed to a node via its incoming links. This algorithm

iterates until PageRank values don’t change anymore.

Figure 5.7 shows a toy example that illustrates two iterations of the algorithm. As a simpliﬁ-

cation, we ignore the random jump factor for now (i.e., α = 0) and further assume that there are no

dangling nodes (i.e., nodes with no outgoing edges). The algorithm begins by initializing a uniform

distribution of PageRank values across nodes. In the beginning of the ﬁrst iteration (top, left), partial

PageRank contributions are sent from each node to its neighbors connected via outgoing links. For

example, n

1

sends 0.1 PageRank mass to n

2

and 0.1 PageRank mass to n

4

. This makes sense in

terms of the randomsurfer model: if the surfer is at n

1

with a probability of 0.2, then the surfer could

end up either in n

2

or n

4

with a probability of 0.1 each. The same occurs for all the other nodes in

the graph: note that n

5

must split its PageRank mass three ways, since it has three neighbors, and

n

4

receives all the mass belonging to n

3

because n

3

isn’t connected to any other node. The end of

the ﬁrst iteration is shown in the top right: each node sums up PageRank contributions from its

neighbors. Note that since n

1

has only one incoming link, from n

3

, its updated PageRank value is

smaller than before, i.e., it “passed along” more PageRank mass than it received. The exact same

process repeats, and the second iteration in our toy example is illustrated by the bottom two graphs.

At the beginning of each iteration, the PageRank values of all nodes sum to one. PageRank mass is

5.3. PAGERANK 97

It ti 1

n

1

(0.2)

n

2

(0.2)

0.1

0.1

0.1

0.1

n

1

(0.066)

n

2

(0.166) Iteration 1

n

3

(0.2)

n

5

(0.2)

0.2 0.2

0.066

0.066

0.066

n

3

(0.166)

n

5

(0.3)

n

4

(0.2) n

4

(0.3)

n

2

(0.166) n

2

(0.133) Iteration 2

n

1

(0.066)

0.033

0.033

0.083

0.083

0 1

0 1

0.1

n

1

(0.1)

n (0 3)

n

3

(0.166)

n

5

(0.3)

0.3

0.166

0.1

0.1

n (0 2)

n

3

(0.183)

n

5

(0.383)

n

4

(0.3) n

4

(0.2)

Figure 5.7: PageRank toy example showing two iterations, top and bottom. Left graphs showPageRank

values at the beginning of each iteration and how much PageRank mass is passed to each neighbor. Right

graphs show updated PageRank values at the end of each iteration.

preserved by the algorithm, guaranteeing that we continue to have a valid probability distribution

at the end of each iteration.

Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.8; it is simpliﬁed

in that we continue to ignore the random jump factor and assume no dangling nodes (complications

that we will return to later). An illustration of the running algorithm is shown in Figure 5.9 for the

ﬁrst iteration of the toy graph in Figure 5.7. The algorithm maps over the nodes, and for each node

computes how much PageRank mass needs to be distributed to its neighbors (i.e., nodes on the

adjacency list). Each piece of the PageRank mass is emitted as the value, keyed by the node ids of

the neighbors. Conceptually, we can think of this as passing PageRank mass along outgoing edges.

In the shufﬂe and sort phase, the MapReduce execution framework groups values (piece of

PageRank mass) passed along the graph edges by destination node (i.e., all edges that point to the

same node). In the reducer, PageRank mass contributions from all incoming edges are summed

to arrive at the updated PageRank value for each node. As with the parallel breadth-ﬁrst search

algorithm, the graph structure itself must be passed from iteration to iteration. Each node data

structure is emitted in the mapper and written back out to disk in the reducer. All PageRank mass

98 5. GRAPHALGORITHMS

1: class Mapper

2: method Map(nid n, node N)

3: p ← N.PageRank/|N.AdjacencyList|

4: Emit(nid n, N) Pass along graph structure

5: for all nodeid m ∈ N.AdjacencyList do

6: Emit(nid m, p) Pass PageRank mass to neighbors

1: class Reducer

2: method Reduce(nid m, [p

1

, p

2

, . . .])

3: M ← ∅

4: for all p ∈ counts [p

1

, p

2

, . . .] do

5: if IsNode(p) then

6: M ← p Recover graph structure

7: else

8: s ← s +p Sum incoming PageRank contributions

9: M.PageRank← s

10: Emit(nid m, node M)

Figure 5.8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the random

jump factor). In the map phase we evenly divide up each node’s PageRank mass and pass each piece

along outgoing edges to neighbors. In the reduce phase PageRank contributions are summed up at each

destination node. Each MapReduce job corresponds to one iteration of the algorithm.

emitted by the mappers is accounted for in the reducer: since we begin with the sum of PageRank

values across all nodes equal to one, the sum of all the updated PageRank values should remain a

valid probability distribution.

Having discussed the simpliﬁed PageRank algorithm in MapReduce, let us now take into

account the random jump factor and dangling nodes: as it turns out both are treated similarly.

Dangling nodes are nodes in the graph that have no outgoing edges, i.e., their adjacency lists are

empty. In the hyperlink graph of the web, these might correspond to pages in a crawl that have not

been downloaded yet. If we simply run the algorithm in Figure 5.8 on graphs with dangling nodes,

the total PageRank mass will not be conserved, since no key-value pairs will be emitted when a

dangling node is encountered in the mappers.

The proper treatment of PageRank mass “lost” at the dangling nodes is to redistribute it across

all nodes in the graph evenly (cf. [22]). There are many ways to determine the missing PageRank

mass. One simple approach is by instrumenting the algorithmin Figure 5.8 with counters: whenever

the mapper processes a node with an empty adjacency list, it keeps track of the node’s PageRank

value in the counter. At the end of the iteration, we can access the counter to ﬁnd out how much

5.3. PAGERANK 99

n

5

[n

1

, n

2

, n

3

] n

1

[n

2

, n

4

] n

2

[n

3

, n

5

] n

3

[n

4

] n

4

[n

5

]

Map

n

2

n

4

n

3

n

5

n

1

n

2

n

3

n

4

n

5

Map

n

2

n

4

n

3

n

5

n

1

n

2

n

3

n

4

n

5

Reduce

n

5

[n

1

, n

2

, n

3

] n

1

[n

2

, n

4

] n

2

[n

3

, n

5

] n

3

[n

4

] n

4

[n

5

]

Figure 5.9: Illustration of the MapReduce PageRank algorithm corresponding to the ﬁrst iteration in

Figure 5.7. The size of each box is proportion to its PageRank value. During the map phase, PageRank

mass is distributed evenly to nodes on each node’s adjacency list (shown at the very top). Intermediate

values are keyed by node (shown inside the boxes). In the reduce phase, all partial PageRank contributions

are summed together to arrive at updated values.

PageRank mass was lost at the dangling nodes.

8

Another approach is to reserve a special key for

storing PageRank mass from dangling nodes. When the mapper encounters a dangling node, its

PageRank mass is emitted with the special key; the reducer must be modiﬁed to contain special logic

for handling the missing PageRank mass. Yet another approach is to write out the missing PageRank

mass as “side data” for each map task (using the in-mapper combining technique for aggregation); a

ﬁnal pass in the driver program is needed to sum the mass across all map tasks. Either way, we arrive

at the amount of PageRank mass lost at the dangling nodes—this then must be redistribute evenly

across all nodes.

This redistribution process can be accomplished by mapping over all nodes again. At the same

time, we can take into account the random jump factor. For each node, its current PageRank value

p is updated to the ﬁnal PageRank value p

**according to the following formula:
**

p

= α

_

1

|G|

_

+(1 −α)

_

m

|G|

+p

_

(5.2)

where m is the missing PageRank mass, and |G| is the number of nodes in the entire graph. We add

the PageRank mass from link traversal (p, computed from before) to the share of the lost PageRank

mass that is distributed to each node (m/|G|). Finally, we take into account the random jump factor:

8

In Hadoop, counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large constant, and then cast

as an integer.

100 5. GRAPHALGORITHMS

with probability α the random surfer arrives via jumping, and with probability 1 −α the random

surfer arrives via incoming links. Note that this MapReduce job requires no reducers.

Putting everything together, one iteration of PageRank requires two MapReduce jobs: the

ﬁrst to distribute PageRank mass along graph edges, and the second to take care of dangling nodes

and the random jump factor. At the end of each iteration, we end up with exactly the same data

structure as the beginning, which is a requirement for the iterative algorithm to work. Also, the

PageRank values of all nodes sum up to one, which ensures a valid probability distribution.

Typically, PageRank is iterated until convergence, i.e., when the PageRank values of nodes

no longer change (within some tolerance, to take into account, for example, ﬂoating point precision

errors). Therefore, at the end of each iteration, the PageRank driver program must check to see

if convergence has been reached. Alternative stopping criteria include running a ﬁxed number of

iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of

PageRank values no longer change. The latter is useful for some applications that only care about

comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. Rank

stability is obtained faster than the actual convergence of values.

In absolute terms, how many iterations are necessary for PageRank to converge? This is a

difﬁcult question to precisely answer since it depends on many factors, but generally, fewer than one

might expect. In the original PageRank paper [117], convergence on a graph with 322 million edges

was reached in 52 iterations (see also Bianchini et al. [22] for additional discussion). On today’s

web, the answer is not very meaningful due to the adversarial nature of web search as previously

discussed—the web is full of spam and populated with sites that are actively trying to “game”

PageRank and related hyperlink-based metrics. As a result, running PageRank in its unmodiﬁed

form presented here would yield unexpected and undesirable results. Of course, strategies developed

by web search companies to combat link spam are proprietary (and closely guarded secrets, for

obvious reasons)—but undoubtedly these algorithmic modiﬁcations impact convergence behavior.

A full discussion of the escalating “arms race” between search engine companies and those that seek

to promote their sites is beyond the scope of this book.

9

5.4 ISSUES WITHGRAPHPROCESSING

The biggest difference between MapReduce graph algorithms and single-machine graph algorithms

is that with the latter, it is usually possible to maintain global data structures in memory for fast,

random access. For example, Dijkstra’s algorithm uses a global priority queue that guides the expan-

sion of nodes. This, of course, is not possible with MapReduce—the programming model does not

provide any built-in mechanism for communicating global state. Since the most natural representa-

tion of large sparse graphs is with adjacency lists, communication can only occur from a node to the

9

For the interested reader, the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb) provide great

starting points into the literature.

5.4. ISSUES WITHGRAPHPROCESSING 101

nodes it links to, or to a node from nodes linked to it—in other words, passing information is only

possible within the local graph structure.

10

This restriction gives rise to the structure of many graph algorithms in MapReduce: local

computation is performed on each node, the results of which are “passed” to its neighbors. With

multiple iterations, convergence on the global graph is possible. The passing of partial results along

a graph edge is accomplished by the shufﬂing and sorting provided by the MapReduce execution

framework. The amount of intermediate data generated is on the order of the number of edges,

which explains why all the algorithms we have discussed assume sparse graphs. For dense graphs,

MapReduce running time would be dominated by copying intermediate data across the network,

which in the worst case is O(n

2

) in the number of nodes in the graph. Since MapReduce clusters

are designed around commodity networks (e.g., gigabit Ethernet), MapReduce algorithms are often

impractical on large, dense graphs.

Combiners and the in-mapper combining pattern described in Section 3.1 can be used to

decrease the running time of graph iterations. It is straightforward to use combiners for both parallel

breadth-ﬁrst search and PageRank since Min and sum, used in the two algorithms, respectively, are

both associative and commutative. However, combiners are only effective to the extent that there

are opportunities for partial aggregation—unless there are nodes pointed to by multiple nodes being

processed by an individual map task, combiners are not very useful. This implies that it would be

desirable to partition large graphs into smaller components where there are many intra-component

links and fewer inter-component links. This way, we can arrange the data such that nodes in the

same component are handled by the same map task—thus maximizing opportunities for combiners

to perform local aggregation.

Unfortunately, this sometimes creates a chicken-and-egg problem. It would be desirable to

partition a large graph to facilitate efﬁcient processing by MapReduce. But the graph may be so

large that we can’t partition it except with MapReduce algorithms! Fortunately, in many cases there

are simple solutions around this problem in the form of “cheap” partitioning heuristics based on

reordering the data [106]. For example, in a social network, we might sort nodes representing users

by zip code, as opposed to by last name—based on the observation that friends tend to live close

to each other. Sorting by an even more cohesive property such as school would be even better (if

available): the probability of any two random students from the same school knowing each other is

much higher than two randomstudents fromdifferent schools. Another good example is to partition

the web graph by the language of the page (since pages in one language tend to link mostly to other

pages in that language) or by domain name (since inter-domain links are typically much denser than

intra-domain links). Resorting records using MapReduce is both easy to do and a relatively cheap

operation—however, whether the efﬁciencies gained by this crude form of partitioning are worth

the extra time taken in performing the resort is an empirical question that will depend on the actual

graph structure and algorithm.

10

Of course, it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example, if one wishes to

propagate information from a node to all nodes that are within two links, one could process graph G to derive graph G

, where

there would exist a link from node n

i

to n

j

if n

j

was reachable within two link traversals of n

i

in the original graph G.

102 5. GRAPHALGORITHMS

Finally, there is a practical considerationto keep inmindwhenimplementing graphalgorithms

that estimate probability distributions over nodes (such as PageRank). For large graphs, the probabil-

ity of any particular node is often so small that it underﬂows standard ﬂoating point representations.

A very common solution to this problem is to represent probabilities using their logarithms. When

probabilities are stored as logs, the product of two values is simply their sum. However, addition of

probabilities is also necessary, for example, when summing PageRank contribution for a node. This

can be implemented with reasonable precision as follows:

a ⊕b =

_

b +log(1 +e

a−b

) a < b

a +log(1 +e

b−a

) a ≥ b

Furthermore, many math libraries include a log1p function which computes log(1 +x) with higher

precision than the naïve implementation would have when x is very small (as is often the case when

working with probabilities). Its use may further improve the accuracy of implementations that use

log probabilities.

5.5 SUMMARY ANDADDITIONAL READINGS

This chapter covers graph algorithms in MapReduce, discussing in detail parallel breadth-ﬁrst search

and PageRank. Both are instances of a large class of iterative algorithms that share the following

characteristics:

• The graph structure is represented with adjacency lists.

• Algorithms map over nodes and pass partial results to nodes on their adjacency lists. Partial

results are aggregated for each node in the reducer.

• The graph structure itself is passed from the mapper to the reducer, such that the output is in

the same form as the input.

• Algorithms are iterative and under the control of a non-MapReduce driver program, which

checks for termination at the end of each iteration.

The MapReduce programming model does not provide a mechanism to maintain global data struc-

tures accessible and mutable by all the mappers and reducers.

11

One implication of this is that

communication between pairs of arbitrary nodes is difﬁcult to accomplish. Instead, information

typically propagates along graph edges—which gives rise to the structure of algorithms discussed

above.

Additional Readings. The ubiquity of large graphs translates into substantial interest in scalable

graph algorithms using MapReduce in industry, academia, and beyond. There is, of course, much

beyond what has been covered in this chapter. For additional material, we refer readers to the

11

However, maintaining globally-synchronized state may be possible with the assistance of other tools (e.g., a distributed database).

5.5. SUMMARY ANDADDITIONAL READINGS 103

following: Kang et al. [80] presented an approach to estimating the diameter of large graphs using

MapReduce and a library for graph mining [81]; Cohen [39] discussed a number of algorithms

for processing undirected graphs, with social network analysis in mind; Rao and Yarowsky [128]

described an implementation of label propagation, a standard algorithmfor semi-supervised machine

learning, on graphs derived from textual data; Schatz [132] tackled the problem of DNA sequence

alignment and assembly with graph algorithms in MapReduce. Finally, it is easy to forget that

parallel graph algorithms have been studied by computer scientists for several decades, particular in

the PRAM model [60; 77]. It is not clear, however, to what extent well-known PRAM algorithms

translate naturally into the MapReduce framework.

105

C H A P T E R 6

EMAlgorithms for Text

Processing

Until the end of the 1980s, text processing systems tended to rely on large numbers of manually

written rules to analyze, annotate, and transform text input, usually in a deterministic way. This

rule-based approach can be appealing: a system’s behavior can generally be understood and predicted

precisely, and, when errors surface, they can be corrected by writing new rules or reﬁning old ones.

However, rule-based systems suffer from a number of serious problems. They are brittle with respect

to the natural variation found in language, and developing systems that can deal with inputs from

diverse domains is very labor intensive. Furthermore, when these systems fail, they often do so

catastrophically, unable to offer even a “best guess” as to what the desired analysis of the input might

be.

In the last 20 years, the rule-based approach has largely been abandoned in favor of more

data-driven methods, where the “rules” for processing the input are inferred automatically from

large corpora of examples, called training data. The basic strategy of the data-driven approach is

to start with a processing algorithm capable of capturing how any instance of the kinds of inputs

(e.g., sentences or emails) can relate to any instance of the kinds of outputs that the ﬁnal system

should produce (e.g., the syntactic structure of the sentence or a classiﬁcation of the email as spam).

At this stage, the system can be thought of as having the potential to produce any output for any

input, but they are not distinguished in any way. Next, a learning algorithm is applied which reﬁnes

this process based on the training data—generally attempting to make the model perform as well as

possible at predicting the examples in the training data. The learning process, which often involves

iterative algorithms, typically consists of activities like ranking rules, instantiating the content of rule

templates, or determining parameter settings for a given model. This is known as machine learning,

an active area of research.

Data-driven approaches have turned out to have several beneﬁts over rule-based approaches

to system development. Since data-driven systems can be trained using examples of the kind that

they will eventually be used to process, they tend to deal with the complexities found in real data

more robustly than rule-based systems do. Second, developing training data tends to be far less

expensive than developing rules. For some applications, signiﬁcant quantities of training data may

even exist for independent reasons (e.g., translations of text into multiple languages are created by

authors wishing to reach an audience speaking different languages, not because they are generating

training data for a data-driven machine translation system). These advantages come at the cost of

106 6. EMALGORITHMS FORTEXTPROCESSING

systems that often behave internally quite differently than a human-engineered system. As a result,

correcting errors that the trained system makes can be quite challenging.

Data-driven information processing systems can be constructed using a variety of mathe-

matical techniques, but in this chapter we focus on statistical models, which probabilistically relate

inputs from an input set X (e.g., sentences, documents, etc.), which are always observable, to anno-

tations from a set Y, which is the space of possible annotations or analyses that the system should

predict. This model may take the form of either a joint model Pr(x, y) which assigns a probability

to every pair x, y ∈ X ×Y or a conditional model Pr(y|x), which assigns a probability to every

y ∈ Y, given an x ∈ X. For example, to create a statistical spam detection system, we might have

Y = {Spam,NotSpam} and X be the set of all possible email messages. For machine translation,

X might be the set of Arabic sentences and Y the set of English sentences.

1

There are three closely related, but distinct, challenges in statistical text processing. The ﬁrst

is model selection. This entails selecting a representation of a joint or conditional distribution over

the desired X and Y. For a problem where X and Y are very small, one could imagine representing

these probabilities in lookup tables. However, for something like email classiﬁcation or machine

translation, where the model space is inﬁnite, the probabilities cannot be represented directly, and

must be computed algorithmically. As an example of such models, we introduce hidden Markov

models (HMMs), which deﬁne a joint distribution over sequences of inputs and sequences of anno-

tations. The second challenge is parameter estimation or learning, which involves the application of

an optimization algorithm and training criterion to select the parameters of the model to optimize

the model’s performance (with respect to the given training criterion) on the training data.

2

The

parameters of a statistical model are the values used to compute the probability of some event de-

scribed by the model. In this chapter we will focus on one particularly simple training criterion for

parameter estimation, maximum likelihood estimation, which says to select the parameters that make

the training data most probable under the model, and one learning algorithm that attempts to meet

this criterion, called expectation maximization (EM). The ﬁnal challenge for statistical modeling

is the problem of decoding, or, given some x, using the model to select an annotation y. One very

common strategy is to select y according to the following criterion:

y

∗

= arg max

y∈Y

Pr(y|x)

In a conditional (or direct) model, this is a straightforward search for the best y under the model. In a

joint model, the search is also straightforward, on account of the deﬁnition of conditional probability:

y

∗

= arg max

y∈Y

Pr(y|x) = arg max

y∈Y

Pr(x, y)

y

Pr(x, y

)

= arg max

y∈Y

Pr(x, y)

1

In this chapter, we will consider discrete models only. They tend to be sufﬁcient for text processing, and their presentation is

simpler than models with continuous densities. It should be kept in mind that the sets X and Y may still be countably inﬁnite.

2

We restrict our discussion in this chapter to models with ﬁnite numbers of parameters and where the learning process refers to

setting those parameters. Inference in and learning of so-called nonparameteric models, which have an inﬁnite number of parameters

and have become important statistical models for text processing in recent years, is beyond the scope of this chapter.

107

The speciﬁc form that the search takes will depend on how the model is represented. Our focus in

this chapter will primarily be on the second problem: learning parameters for models, but we will

touch on the third problem as well.

Machine learning is often categorized as either supervised or unsupervised. Supervised learning

of statistical models simply means that the model parameters are estimated from training data

consisting of pairs of inputs and annotations, that is Z = x

1

, y

1

, x

2

, y

2

, . . . where x

i

, y

i

∈

X ×Y and y

i

is the gold standard (i.e., correct) annotation of x

i

. While supervised models often

attain quite good performance, they are often uneconomical to use, since the training data requires

each object that is to be classiﬁed (to pick a speciﬁc task), x

i

to be paired with its correct label, y

i

. In

many cases, these gold standard training labels must be generated by a process of expert annotation,

meaning that each x

i

must be manually labeled by a trained individual. Even when the annotation

task is quite simple for people to carry out (e.g., in the case of spamdetection), the number of potential

examples that could be classiﬁed (representing a subset of X, which may of course be inﬁnite in

size) will far exceed the amount of data that can be annotated. As the annotation task becomes more

complicated (e.g., when predicting more complex structures such as sequences of labels or when the

annotation task requires specialized expertise), annotation becomes far more challenging.

Unsupervised learning, on the other hand, requires only that the training data consist of a

representative collection of objects that should be annotated, that is Z = x

1

, x

2

, . . . where x

i

∈ X,

but without any example annotations. While it may at ﬁrst seem counterintuitive that meaningful

annotations can be learned without any examples of the desired annotations being given, the learning

criteria and model structure (which crucially deﬁne the space of possible annotations Y and the

process by which annotations relate to observable inputs) make it possible to induce annotations

by relying on regularities in the unclassiﬁed training instances. While a thorough discussion of

unsupervised learning is beyond the scope of this book, we focus on a particular class of algorithms—

expectation maximization (EM) algorithms—that can be used to learn the parameters of a joint

model Pr(x, y) from incomplete data (i.e., data where some of the variables in the model cannot be

observed; in the case of unsupervised learning, the y

i

’s are unobserved). Expectation maximization

algorithms ﬁt naturally into the MapReduce paradigm, and are used to solve a number of problems

of interest in text processing. Furthermore, these algorithms can be quite computationally expensive,

since they generally require repeated evaluations of the training data. MapReduce therefore provides

an opportunity not only to scale to larger amounts of data, but also to improve efﬁciency bottlenecks

at scales where non-parallel solutions could be utilized.

This chapter is organized as follows. In Section 6.1, we describe maximum likelihood esti-

mation for statistical models, show how this is generalized to models where not all variables are

observable, and then introduce expectation maximization (EM). We describe hidden Markov mod-

els (HMMs) in Section 6.2, a very versatile class of models that uses EM for parameter estimation.

Section 6.3 discusses how EM algorithms can be expressed in MapReduce, and then in Section 6.4

we look at a case study of word alignment for statistical machine translation. Section 6.5 examines

108 6. EMALGORITHMS FORTEXTPROCESSING

similar algorithms that are appropriate for supervised learning tasks. This chapter concludes with a

summary and pointers to additional readings.

6.1 EXPECTATIONMAXIMIZATION

Expectation maximization (EM) algorithms [49] are a family of iterative optimization algorithms

for learning probability distributions from incomplete data. They are extensively used in statistical

natural language processing where one seeks to infer latent linguistic structure fromunannotated text.

To name just a few applications, EM algorithms have been used to ﬁnd part-of-speech sequences,

constituency and dependency trees, alignments between texts in different languages, alignments

between acoustic signals and their transcriptions, as well as for numerous other clustering and

structure discovery problems.

Expectation maximization generalizes the principle of maximum likelihood estimation to the

case where the values of some variables are unobserved (speciﬁcally, those characterizing the latent

structure that is sought).

6.1.1 MAXIMUMLIKELIHOODESTIMATION

Maximum likelihood estimation (MLE) is a criterion for ﬁtting the parameters θ of a statistical

model to some given data x. Speciﬁcally, it says to select the parameter settings θ

∗

such that the

likelihood of observing the training data given the model is maximized:

θ

∗

= arg max

θ

Pr(X = x; θ) (6.1)

To illustrate, consider the simple marble game shown in Figure 6.1. In this game, a marble is

released at the position indicated by the black dot, and it bounces down into one of the cups at the

bottom of the board, being diverted to the left or right by the peg (indicated by a triangle) in the

center. Our task is to construct a model that predicts which cup the ball will drop into. A“rule-based”

approach might be to take exact measurements of the board and construct a physical model that we

can use to predict the behavior of the ball. Given sophisticated enough measurements, this could

certainly lead to a very accurate model. However, the construction of this model would be quite time

consuming and difﬁcult.

A statistical approach, on the other hand, might be to assume that the behavior of the marble

in this game can be modeled using a Bernoulli random variable Y with parameter p. That is, the

value of the randomvariable indicates whether path 0 or 1 is taken. We also deﬁne a randomvariable

X whose value is the label of the cup that the marble ends up in; note that X is deterministically

related to Y, so an observation of X is equivalent to an observation of Y.

To estimate the parameter p of the statistical model of our game, we need some training data,

so we drop 10 marbles into the game which end up in cups x = b, b, b, a, b, b, b, b, b, a.

6.1. EXPECTATIONMAXIMIZATION 109

a b

0 1

Figure 6.1: A simple marble game where a released marble takes one of two possible paths. This game

can be modeled using a Bernoulli random variable with parameter p, which indicates the probability that

the marble will go to the right when it hits the peg.

What is the maximum likelihood estimate of p given this data? By assuming that our samples

are independent and identically distributed (i.i.d.), we can write the likelihood of our data as follows:

3

Pr(x; p) =

10

j=1

p

δ(x

j

,a)

(1 −p)

δ(x

j

,b)

= p

2

· (1 −p)

8

Since log is a monotonically increasing function, maximizing log Pr(x; p) will give us the desired

result. We can do this differentiating with respect to p and ﬁnding where the resulting expression

equals 0:

d log Pr(x; p)

dp

= 0

d[2 · log p +8 · log(1 −p)]

dp

= 0

2

p

−

8

1 −p

= 0

Solving for p yields 0.2, which is the intuitive result. Furthermore, it is straightforward to show that

in N trials where N

0

marbles followed path 0 to cup a, and N

1

marbles followed path 1 to cup b,

the maximum likelihood estimate of p is N

1

/(N

0

+N

1

).

While this model only makes use of an approximation of the true physical process at work

when the marble interacts with the game board, it is an empirical question whether the model works

well enough in practice to be useful. Additionally, while a Bernoulli trial is an extreme approxi-

mation of the physical process, if insufﬁcient resources were invested in building a physical model,

3

In this equation, δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0 otherwise.

110 6. EMALGORITHMS FORTEXTPROCESSING

0 1 2 3

a b

0

c

1 2 3

Figure 6.2: Amore complicated marble game where the released marble takes one of four possible paths.

We assume that we can only observe which cup the marble ends up in, not the speciﬁc path taken.

the approximation may perform better than the more complicated “rule-based” model. This sort

of dynamic is found often in text processing problems: given enough data, astonishingly simple

models can outperform complex knowledge-intensive models that attempt to simulate complicated

processes.

6.1.2 ALATENTVARIABLEMARBLEGAME

To see where latent variables might come into play in modeling, consider a more complicated variant

of our marble game shown in Figure 6.2. This version consists of three pegs that inﬂuence the

marble’s path, and the marble may end up in one of three cups. Note that both paths 1 and 2 lead

to cup b.

To construct a statistical model of this game, we again assume that the behavior of a marble

interacting witha peg canbe modeled witha Bernoulli randomvariable. Since there are three pegs, we

have three random variables with parameters θ = p

0

, p

1

, p

2

, corresponding to the probabilities

that the marble will go to the right at the top, left, and right pegs. We further deﬁne a random

variable X taking on values from {a, b, c} indicating what cup the marble ends in, and Y, taking

on values from {0, 1, 2, 3} indicating which path was taken. Note that the full joint distribution

Pr(X = x, Y = y) is determined by θ.

How should the parameters θ be estimated? If it were possible to observe the paths taken by

marbles as they were dropped into the game, it would be trivial to estimate the parameters for our

model using the maximum likelihood estimator—we would simply need to count the number of

times the marble bounced left or right at each peg. If N

x

counts the number of times a marble took

path x in N trials, this is:

p

0

=

N

2

+N

3

N

p

1

=

N

1

N

0

+N

1

p

2

=

N

3

N

2

+N

3

6.1. EXPECTATIONMAXIMIZATION 111

However, we wish to consider the case where the paths taken are unobservable (imagine an opaque

sheet covering the center of the game board), but where we can see what cup a marble ends in. In

other words, we want to consider the case where we have partial data. This is exactly the problem

encountered in unsupervised learning: there is a statistical model describing the relationship between

two sets of variables (X’s and Y’s), and there is data available fromjust one of them. Furthermore, such

algorithms are quite useful in text processing, where latent variables may describe latent linguistic

structures of the observed variables, such as parse trees or part-of-speech tags, or alignment structures

relating sets of observed variables (see Section 6.4).

6.1.3 MLEWITHLATENTVARIABLES

Formally, we consider the problem of estimating parameters for statistical models of the form

Pr(X, Y; θ) which describe not only an observable variable X but a latent, or hidden, variable

Y.

In these models, since only the values of the random variable X are observable, we deﬁne our

optimization criterion to be the maximization of the marginal likelihood, that is, summing over all

settings of the latent variable Y, which takes on values from set designated Y:

4

Again, we assume

that samples in the training data x are i.i.d.:

Pr(X = x) =

y∈Y

Pr(X = x, Y = y; θ)

For a vector of training observations x = x

1

, x

2

, . . . , x

**, if we assume the samples are i.i.d.:
**

Pr(x; θ) =

|x|

j=1

y∈Y

Pr(X = x

j

, Y = y; θ)

Thus, the maximum (marginal) likelihood estimate of the model parameters θ

∗

given a vector of

i.i.d. observations x becomes:

θ

∗

= arg max

θ

|x|

j=1

y∈Y

Pr(X = x

j

, Y = y; θ)

Unfortunately, in many cases, this maximum cannot be computed analytically, but the iterative

hill-climbing approach of expectation maximization can be used instead.

4

For this description, we assume that the variables in our model take on discrete values. Not only does this simplify exposition, but

discrete models are widely used in text processing.

112 6. EMALGORITHMS FORTEXTPROCESSING

6.1.4 EXPECTATIONMAXIMIZATION

Expectation maximization (EM) is an iterative algorithm that ﬁnds a successive series of parameter

estimates θ

(0)

, θ

(1)

, . . . that improve the marginal likelihood of the training data. That is, EM

guarantees:

|x|

j=1

y∈Y

Pr(X = x

j

, Y = y; θ

(i+1)

) ≥

|x|

j=1

y∈Y

Pr(X = x

j

, Y = y; θ

(i)

)

The algorithm starts with some initial set of parameters θ

(0)

and then updates them using two

steps: expectation (E-step), which computes the posterior distribution over the latent variables given

the observable data x and a set of parameters θ

(i)

,

5

and maximization (M-step), which computes

new parameters θ

(i+1)

maximizing the expected log likelihood of the joint distribution with respect

to the distribution computed in the E-step. The process then repeats with these new parameters.

The algorithm terminates when the likelihood remains unchanged.

6

In more detail, the steps are as

follows:

E-step. Compute the posterior probability of each possible hidden variable assignments y ∈ Y

for each x ∈ X and the current parameter settings, weighted by the relative frequency with which x

occurs in x. Call this q(X = x, Y = y; θ

(i)

) and note that it deﬁnes a joint probability distribution

over X ×Y in that

(x,y)∈X×Y

q(x, y) = 1.

q(x, y; θ

(i)

) = f (x|x) · Pr(Y = y|X = x; θ

(i)

) = f (x|x) ·

Pr(x, y; θ

(i)

)

y

Pr(x, y

; θ

(i)

)

M-step. Compute new parameter settings that maximize the expected log of the probability of

the joint distribution under the q-distribution that was computed in the E-step:

θ

(i+1)

= arg max

θ

E

q(X=x,Y=y;θ

(i)

)

log Pr(X = x, Y = y; θ

)

= arg max

θ

(x,y)∈X×Y

q(X = x, Y = y; θ

(i)

) · log Pr(X = x, Y = y; θ

)

We omit the proof that the model with parameters θ

(i+1)

will have equal or greater marginal

likelihood on the training data than the model with parameters θ

(i)

, but this is provably true [78].

Before continuing, we note that the effective application of expectation maximization re-

quires that both the E-step and the M-step consist of tractable computations. Speciﬁcally, summing

5

The term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x; θ

(i)

) that are required to

solve the M-step have the form of an expectation (with respect to this distribution).

6

The ﬁnal solution is only guaranteed to be a local maximum, but if the model is fully convex, it will also be the global maximum.

6.1. EXPECTATIONMAXIMIZATION 113

over the space of hidden variable assignments must not be intractable. Depending on the indepen-

dence assumptions made in the model, this may be achieved through techniques such as dynamic

programming. However, some models may require intractable computations.

6.1.5 ANEMEXAMPLE

Let’s look at howto estimate the parameters fromour latent variable marble game fromSection 6.1.2

using EM. We assume training data x consisting of N = |x| observations of X with N

a

, N

b

, and

N

c

indicating the number of marbles ending in cups a, b, and c. We start with some parameters

θ

(0)

= p

(0)

0

, p

(0)

1

, p

(0)

2

that have been randomly initialized to values between 0 and 1.

E-step. We need to compute the distribution q(X = x, Y = y; θ

(i)

), as deﬁned above. We ﬁrst

note that the relative frequency f (x|x) is:

f (x|x) =

N

x

N

Next, we observe that Pr(Y = 0|X = a) = 1 and Pr(Y = 3|X = c) = 1 since cups a and c fully

determine the value of the path variable Y. The posterior probability of paths 1 and 2 are only

non-zero when X is b:

Pr(1|b; θ

(i)

) =

(1 −p

(i)

0

)p

(i)

1

(1 −p

(i)

0

)p

(i)

1

+p

(i)

0

(1 −p

(i)

2

)

Pr(2|b; θ

(i)

) =

p

(i)

0

(1 −p

(i)

2

)

(1 −p

(i)

0

)p

(i)

1

+p

(i)

0

(1 −p

(i)

2

)

Except for the four cases just described, Pr(Y = y|X = x) is zero for all other values of x and y

(regardless of the value of the parameters).

M-step. We now need to maximize the expectation of log Pr(X, Y; θ

**) (which will be a function
**

in terms of the three parameter variables) under the q-distribution we computed in the E step. The

non-zero terms in the expectation are as follows:

x y q(X = x, Y = y; θ

(i)

) log Pr(X = x, Y = y; θ

)

a 0 N

a

/N log(1 −p

0

) +log(1 −p

1

)

b 1 N

b

/N · Pr(1|b; θ

(i)

) log(1 −p

0

) +log p

1

b 2 N

b

/N · Pr(2|b; θ

(i)

) log p

0

+log(1 −p

2

)

c 3 N

c

/N log p

0

+log p

2

Multiplying across each row and adding from top to bottom yields the expectation we wish to

maximize. Each parameter can be optimized independently using differentiation. The resulting

optimal values are expressed in terms of the counts in x and θ

(i)

:

p

0

=

Pr(2|b; θ

(i)

) · N

b

+N

c

N

p

1

=

Pr(1|b; θ

(i)

) · N

b

N

a

+Pr(1|b; θ

(i)

) · N

b

p

2

=

N

c

Pr(2|b; θ

(i)

) · N

b

+N

c

114 6. EMALGORITHMS FORTEXTPROCESSING

It is worth noting that the form of these expressions is quite similar to the fully observed maximum

likelihood estimate. However, rather than depending on exact path counts, the statistics used are the

expected path counts, given x and parameters θ

(i)

.

Typically, the values computed at the end of the M-step would serve as new parameters for

another iteration of EM. However, the example we have presented here is quite simple and the

model converges to a global optimum after a single iteration. For most models, EM requires several

iterations to converge, and it may not ﬁnd a global optimum. And since EM only ﬁnds a locally

optimal solution, the ﬁnal parameter values depend on the values chose for θ

(0)

.

6.2 HIDDENMARKOVMODELS

To give a more substantial and useful example of models whose parameters may be estimated using

EM, we turn to hidden Markov models (HMMs). HMMs are models of data that are ordered

sequentially (temporally, from left to right, etc.), such as words in a sentence, base pairs in a gene,

or letters in a word. These simple but powerful models have been used in applications as diverse as

speech recognition [78], information extraction [139], gene ﬁnding [143], part-of-speech tagging

[44], stock market forecasting [70], text retrieval [108], and word alignment of parallel (translated)

texts [150] (more in Section 6.4).

In an HMM, the data being modeled is posited to have been generated from an underlying

Markov process, which is a stochastic process consisting of a ﬁnite set of states where the probability

of entering a state at time t +1 depends only on the state of the process at time t [130]. Alternatively,

one can view a Markov process as a probabilistic variant of a ﬁnite state machine, where transitions

are taken probabilistically. As another point of comparison, the PageRank algorithm considered

in the previous chapter (Section 5.3) can be understood as a Markov process: the probability of

following any link on a particular page is independent of the path taken to reach that page. The

states of this Markov process are, however, not directly observable (i.e., hidden). Instead, at each

time step, an observable token (e.g., a word, base pair, or letter) is emitted according to a probability

distribution conditioned on the identity of the state that the underlying process is in.

A hidden Markov model Mis deﬁned as a tuple S, O, θ. S is a ﬁnite set of states, which

generate symbols from a ﬁnite observation vocabulary O. Following convention, we assume that

variables q, r, and s refer to states in S, and o refers to symbols in the observation vocabulary O.

This model is parameterized by the tuple θ = A, B, π consisting of an |S| ×|S| matrix A of

transition probabilities, where A

q

(r) gives the probability of transitioning from state q to state r; an

|S| ×|O| matrix B of emission probabilities, where B

q

(o) gives the probability that symbol o will be

emitted from state q; and an |S|-dimensional vector π, where π

q

is the probability that the process

starts in state q.

7

These matrices may be dense, but for many applications sparse parameterizations

7

This is only one possible deﬁnition of an HMM, but it is one that is useful for many text processing problems. In alternative

deﬁnitions, initial and ﬁnal states may be handled differently, observations may be emitted during the transition between states,

or continuous-valued observations may be emitted (for example, from a Gaussian distribution).

6.2. HIDDENMARKOVMODELS 115

are useful. We further stipulate that A

q

(r) ≥ 0, B

q

(o) ≥ 0, and π

q

≥ 0 for all q, r, and o, as well as

that:

r∈S

A

q

(r) = 1 ∀q

o∈O

B

q

(o) = 1 ∀q

q∈S

π

q

= 1

A sequence of observations of length τ is generated as follows:

Step 0: let t = 1 and select an initial state q according to the distribution π.

Step 1: an observation symbol from O is emitted according to the distribution B

q

.

Step 2: a new q is drawn according to the distribution A

q

.

Step 3: t is incremented, and if t ≤ τ, the process repeats from Step 1.

Since all events generated by this process are conditionally independent, the joint probability of this

sequence of observations and the state sequence used to generate it is the product of the individual

event probabilities.

Figure 6.3 shows a simple example of a hidden Markov model for part-of-speech tagging,

which is the task of assigning to each word in an input sentence its grammatical category (one of

the ﬁrst steps in analyzing textual content). States S = {det, adj, nn, v } correspond to the parts

of speech (determiner, adjective, noun, and verb), and observations O = {the, a, green, . . .} are a

subset of Englishwords.This example illustrates a key intuitionbehind many applications of HMMs:

states correspond to equivalence classes or clustering of observations, and a single observation type

may associated with several clusters (in this example, the word wash can be generated by an nn or

v, since wash can either be a noun or a verb).

6.2.1 THREEQUESTIONS FORHIDDENMARKOVMODELS

There are three fundamental questions associated with hidden Markov models:

8

1. Given a model M= S, O, θ, and an observation sequence of symbols from O, x =

x

1

, x

2

, . . . , x

τ

, what is the probability that Mgenerated the data (summing over all possible

state sequences, Y)?

Pr(x) =

y∈Y

Pr(x, y; θ)

2. Given a model M= S, O, θ and an observation sequence x, what is the most likely sequence

of states that generated the data?

y

∗

= arg max

y∈Y

Pr(x, y; θ)

8

The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125].

116 6. EMALGORITHMS FORTEXTPROCESSING

0 0 0 0.5

0.3 0.2 0.1 0.2

0.7 0.7 0.4 0.2

0 0.1 0.5 0.1

DET

ADJ

NN

V

DET

the

a

0.7

0.3

ADJ

green

big

old

might

0.1

0.4

0.4

NN

book

plants

people

person

John

wash

0.3

0.2

0.2

0.1

V

might

wash

washes

loves

reads

books

0.2

0.3

0.2

0.01

DET ADJ NN V

DET

ADJ

NN

V

Transition probabilities:

Emission probabilities:

0.1

0.19 0.1

Example outputs:

John might wash

the big green person loves old plants

NN V V

DET ADJ ADJ NN V ADJ NN

plants washes books books books

NN V V NN V

Initial probabilities:

DET ADJ NN V

0.5 0.3 0.1 0.1

0.1

0.1

Figure 6.3: An example HMM that relates part-of-speech tags to vocabulary items in an English-like

language. Possible (probability > 0) transitions for the Markov process are shown graphically. In the

example outputs, the state sequences corresponding to the emissions are written beneath the emitted

symbols.

6.2. HIDDENMARKOVMODELS 117

3. Given a set of states S, an observation vocabulary O, and a series of i.i.d. observation se-

quences x

1

, x

2

, . . . , x

**, what are the parameters θ = A, B, π that maximize the likelihood
**

of the training data?

θ

∗

= arg max

θ

i=1

y∈Y

Pr(x

i

, y; θ)

Using our deﬁnition of an HMM, the answers to the ﬁrst two questions are in principle quite

trivial to compute: by iterating over all state sequences Y, the probability that each generated x can

be computed by looking up and multiplying the relevant probabilities in A, B, and π, and then

summing the result or taking the maximum. And, as we hinted at in the previous section, the third

question can be answered using EM. Unfortunately, even with all the distributed computing power

MapReduce makes available, we will quickly run into trouble if we try to use this naïve strategy since

there are |S|

τ

distinct state sequences of length τ, making exhaustive enumeration computationally

intractable. Fortunately, because the underlying model behaves exactly the same whenever it is in

some state, regardless of how it got to that state, we can use dynamic programming algorithms to

answer all of the above questions without summing over exponentially many sequences.

6.2.2 THEFORWARDALGORITHM

Given some observation sequence, for example x = John, might, wash, Question 1 asks what is

the probability that this sequence was generated by an HMMM= S, O, θ. For the purposes of

illustration, we assume that Mis deﬁned as shown in Figure 6.3.

There are two ways to compute the probability of x having been generated by M. The ﬁrst

is to compute the sum over the joint probability of x and every possible labeling y

∈ { det ,

det, det, det , det, nn, det, det, v, . . .}. As indicated above, this is not feasible for most

sequences, since the set of possible labels is exponential in the length of x. The second, fortunately,

is much more efﬁcient.

We can make use of what is known as the forward algorithmto compute the desired probability

in polynomial time. We assume a model M= S, O, θ as deﬁned above. This algorithm works by

recursively computing the answer to a related question: what is the probability that the process is in

state q at time t and has generated x

1

, x

2

, . . . , x

t

? Call this probability α

t

(q). Thus, α

t

(q) is a two

dimensional matrix (of size |x| ×|S|), called a trellis. It is easy to see that the values of α

1

(q) can be

computed as the product of two independent probabilities: the probability of starting in state q and

the probability of state q generating x

1

:

α

1

(q) = π

q

· B

q

(x

1

)

From this, it’s not hard to see that the values of α

2

(r) for every r can be computed in terms of the

|S| values in α

1

(·) and the observation x

2

:

118 6. EMALGORITHMS FORTEXTPROCESSING

α

2

(r) = B

r

(x

2

) ·

q∈S

α

1

(q) · A

q

(r)

This works because there are |S| different ways to get to state r at time t = 2: starting from state

1, 2, . . . , |S| and transitioning to state r. Furthermore, because the behavior of a Markov process is

determined only by the state it is in at some time (not by how it got to that state), α

t

(r) can always

be computed in terms of the |S| values in α

t −1

(·) and the observation x

t

:

α

t

(r) = B

r

(x

t

) ·

q∈S

α

t −1

(q) · A

q

(r)

We have now shown how to compute the probability of being in any state q at any time t , having

generated x

1

, x

2

, . . . , x

t

, with the forward algorithm. The probability of the full sequence is the

probability of being in time |x| and in any state, so the answer to Question 1 can be computed simply

by summing over α values at time |x| for all states:

Pr(x; θ) =

q∈S

α

|x|

(q)

In summary, there are two ways of computing the probability that a sequence of observations x

was generated by M: exhaustive enumeration with summing and the forward algorithm. Figure 6.4

illustrates the two possibilities. The upper panel shows the naïve exhaustive approach, enumerating

all 4

3

possible labels y

of x and computing their joint probability Pr(x, y

). Summing over all y

,

the marginal probability of x is found to be 0.00018. The lower panel shows the forward trellis,

consisting of 4 ×3 cells. Summing over the ﬁnal column also yields 0.00018, the same result.

6.2.3 THEVITERBI ALGORITHM

Given an observation sequence x, the second question we might want to ask of Mis: what is the most

likely sequence of states that generated the observations? As with the previous question, the naïve

approach to solving this problem is to enumerate all possible labels and ﬁnd the one with the highest

joint probability. Continuing with the example observation sequence x = John, might, wash,

examining the chart of probabilities in the upper panel of Figure 6.4 shows that y

∗

= nn, v, v is

the most likely sequence of states under our example HMM.

However, a more efﬁcient answer to Question 2 can be computed using the same intuition

in the forward algorithm: determine the best state sequence for a short sequence and extend this to

easily compute the best sequence for longer ones. This is known as the Viterbi algorithm. We deﬁne

γ

t

(q), the Viterbi probability, to be the most probable sequence of states ending in state q at time t

and generating observations x

1

, x

2

, . . . , x

t

. Since we wish to be able to reconstruct the sequence

6.2. HIDDENMARKOVMODELS 119

John might wash

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

John might wash

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

ADJ

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

John might wash

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

NN

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.000021

0.000009

0.0

0.0

0.0

0.0

0.0

0.0

0.00006

0.00009

John might wash

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

DET

DET

DET

DET

ADJ

ADJ

ADJ

ADJ

NN

NN

NN

NN

V

V

V

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

DET

ADJ

NN

V

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

p(x,y) p(x,y) p(x,y) p(x,y)

0.0

John might wash

DET

ADJ

V

NN

0.0 0.0

0.0

0.03

0.0

0.0003

0.0

0.003

0.0

0.000081

0.000099

Pr(x) =

y∈Y

Pr(x, y; θ) = 0.00018

Pr(x) =

q∈S

α

3

(q) = 0.00018

α

1 α

2

α

3

Figure 6.4: Computing the probability of the sequence John, might, wash under the HMM given in

Figure 6.3 by explicitly summing over all possible sequence labels (upper panel) and using the forward

algorithm (lower panel).

120 6. EMALGORITHMS FORTEXTPROCESSING

of states, we deﬁne bp

t

(q), the “backpointer”, to be the state used in this sequence at time t −1. The

base case for the recursion is as follows (the state index of −1 is used as a placeholder since there is

no previous best state at time t = 1):

γ

1

(q) = π

q

· B

q

(x

1

)

bp

1

(q) = −1

The recursion is similar to that of the forward algorithm, except rather than summing over previous

states, the maximum value of all possible trajectories into state r at time t is computed. Note that

the backpointer simply records the index of the originating state—a separate computation is not

necessary.

γ

t

(r) = max

q∈S

γ

t −1

(q) · A

q

(r) · B

r

(x

t

)

bp

t

(r) = arg max

q∈S

γ

t −1

(q) · A

q

(r) · B

r

(x

t

)

To compute the best sequence of states, y

∗

, the state with the highest probability path at time |x| is

selected, and then the backpointers are followed, recursively, to construct the rest of the sequence:

y

∗

|x|

= arg max

q∈S

γ

|x|

(q)

y

∗

t −1

= bp

t

(y

t

)

Figure 6.5 illustrates a Viterbi trellis, including backpointers that have been used to compute the

most likely state sequence.

6.2.4 PARAMETERESTIMATIONFORHMMS

We nowturn to Question 3: given a set of states S and observation vocabulary O, what are the param-

eters θ

∗

= A, B, π that maximize the likelihood of a set of training examples, x

1

, x

2

, . . . , x

?

9

Since our model is constructed in terms of variables whose values we cannot observe (the state

sequence) in the training data, we may train it to optimize the marginal likelihood (summing over

all state sequences) of x using EM. Deriving the EM update equations requires only the application

of the techniques presented earlier in this chapter and some differential calculus. However, since

the formalism is cumbersome, we will skip a detailed derivation, but readers interested in more

information can ﬁnd it in the relevant citations [78; 125].

In order to make the update equations as intuitive as possible, consider a fully observable

HMM, that is, one where both the emissions and the state sequence are observable in all training

instances. In this case, a training instance can be depicted as shown in Figure 6.6. When this is the

9

Since an HMM models sequences, training data consist of a collection of example sequences.

6.2. HIDDENMARKOVMODELS 121

0.0

John might wash

DET

ADJ

V

NN

0.0 0.0

0.0

0.03

0.0

0.0003

0.0

0.003

0.0

0.00006

0.00009

γ

1

γ

2

γ

3

Figure 6.5: Computing the most likely state sequence that generated John, might, wash under the

HMM given in Figure 6.3 using the Viterbi algorithm. The most likely state sequence is highlighted in

bold and could be recovered programmatically by following backpointers from the maximal probability

cell in the last column to the ﬁrst column (thicker arrows).

John might wash

DET

ADJ

V

NN

Figure 6.6: A“fully observable” HMM training instance. The output sequence is at the top of the ﬁgure,

and the corresponding states and transitions are shown in the trellis below.

case, such as when we have a corpus of sentences in which all words have already been tagged with

their parts of speech, the maximumlikelihood estimate for the parameters can be computed in terms

of the counts of the number of times the process transitions from state q to state r in all training

instances, T (q → r); the number of times that state q emits symbol o, O(q ↑ o); and the number

of times the process starts in state q, I (q). In this example, the process starts in state nn; there is

one nn→ v transition and one v → v transition. The nn state emits John in the ﬁrst time step,

and v state emits might and wash in the second and third time steps, respectively. We also deﬁne

N(q) to be the number of times the process enters state q. The maximum likelihood estimates of

the parameters in the fully observable case are:

122 6. EMALGORITHMS FORTEXTPROCESSING

π

q

=

I (q)

=

r

I (r)

A

q

(r) =

T (q → r)

N(q) =

r

T (q → r

)

B

q

(o) =

O(q ↑ o)

N(q) =

o

O(q ↑ o

)

(6.2)

For example, to compute the emission parameters from state nn, we simply need to keep track of

the number of times the process is in state nn and what symbol it generated at each of these times.

Transition probabilities are computed similarly: to compute, for example, the distribution A

det

(·),

that is, the probabilities of transitioning away from state det, we count the number of times the

process is in state det, and keep track of what state the process transitioned into at the next time

step. This counting and normalizing be accomplished using the exact same counting and relative

frequency algorithms that we described in Section 3.3. Thus, in the fully observable case, parameter

estimation is not a new algorithm at all, but one we have seen before.

How should the model parameters be estimated when the state sequence is not provided? It

turns out that the update equations have the satisfying form where the optimal parameter values

for iteration i +1 are expressed in terms of the expectations of the counts referenced in the fully

observed case, according to the posterior distribution over the latent variables given the observations

x and the parameters θ

(i)

:

π

q

=

E[I (q)]

A

q

(r) =

E[T (q → r)]

E[N(q)]

B

q

(o) =

E[O(q ↑ o)]

E[N(q)]

(6.3)

Because of the independence assumptions made in the HMM, the update equations consist of

2 · |S| +1 independent optimization problems, just as was the case with the ‘observable’ HMM.

Solving for the initial state distribution, π, is one problem; there are |S| solving for the transition

distributions A

q

(·) from each state q; and |S| solving for the emissions distributions B

q

(·) from

each state q. Furthermore, we note that the following must hold:

E[N(q)] =

r∈S

E[T (q → r)] =

o∈O

E[O(q ↑ o)]

As a result, the optimization problems (i.e., Equations 6.2) require completely independent sets of

statistics, which we will utilize later to facilitate efﬁcient parallelization in MapReduce.

How can the expectations in Equation 6.3 be understood? In the fully observed training case,

between every time step, there is exactly one transition taken and the source and destination states

are observable. By progressing through the Markov chain, we can let each transition count as ‘1’, and

we can accumulate the total number of times each kind of transition was taken (by each kind, we

simply mean the number of times that one state follows another, for example, the number of times

nn follows det). These statistics can then in turn be used to compute the MLE for an ‘observable’

HMM, as described above. However, when the transition sequence is not observable (as is most

often the case), we can instead imagine that at each time step, every possible transition (there are

|S|

2

of them, and typically |S| is quite small) is taken, with a particular probability. The probability

used is the posterior probability of the transition, given the model and an observation sequence (we

6.2. HIDDENMARKOVMODELS 123

describe how to compute this value below). By summing over all the time steps in the training data,

and using this probability as the ‘count’ (rather than ‘1’ as in the observable case), we compute the

expected count of the number of times a particular transition was taken, given the training sequence.

Furthermore, since the training instances are statistically independent, the value of the expectations

can be computed by processing each training instance independently and summing the results.

Similarly for the necessary emission counts (the number of times each symbol in O was

generated by each state in S), we assume that any state could have generated the observation. We

must therefore compute the probability of being in every state at each time point, which is then the

size of the emission ‘count’. By summing over all time steps we compute the expected count of the

number of times that a particular state generated a particular symbol. These two sets of expectations,

which are written formally here, are sufﬁcient to execute the M-step.

E[O(q ↑ o)] =

|x|

i=1

Pr(y

i

= q|x; θ) · δ(x

i

, o) (6.4)

E[T (q → r)] =

|x|−1

i=1

Pr(y

i

= q, y

i+1

= r|x; θ) (6.5)

Posterior probabilities. The expectations necessary for computing the M-step in HMM training

are sums of probabilities that a particular transition is taken, given an observation sequence, and

that some state emits some observation symbol, given an observation sequence. These are referred to

as posterior probabilities, indicating that they are the probability of some event whose distribution

we have a prior belief about, after addition evidence has been taken into consideration (here, the

model parameters characterize our prior beliefs, and the observation sequence is the evidence).

Both posterior probabilities can be computed by combining the forward probabilities, α

t

(·), which

give the probability of reaching some state at time t , by any path, and generating the observations

x

1

, x

2

, . . . , x

t

, with backward probabilities, β

t

(·), which give the probability of starting in some

state at time t and generating the rest of the sequence x

t +1

, x

t +2

, . . . , x

|x|

, using any sequence of

states to do so. The algorithm for computing the backward probabilities is given a bit later. Once the

forward and backward probabilities have been computed, the state transition posterior probabilities

and the emission posterior probabilities can be written as follows:

Pr(y

i

= q|x; θ) = α

i

(q) · β

i

(q) (6.6)

Pr(y

i

= q, y

i+1

= r|x; θ) = α

i

(q) · A

q

(r) · B

r

(x

i+1

) · β

i+1

(r) (6.7)

Equation 6.6 is the probability of being in state q at time i, given x, and the correctness of the

expression should be clear from the deﬁnitions of forward and backward probabilities. The intuition

for Equation 6.7, the probability of taking a particular transition at a particular time, is also not

complicated. It is the product of four conditionally independent probabilities: the probability of

124 6. EMALGORITHMS FORTEXTPROCESSING

a b c b

S

1

S

2

S

3

2

(2)

3

(2)

b

Figure 6.7: Using forward and backward probabilities to compute the posterior probability of the dashed

transition, given the observation sequence a b b c b. The shaded area on the left corresponds to the

forward probability α

2

(s

2

), and the shaded area on the right corresponds to the backward probability

β

3

(s

2

).

getting to state q at time i (having generated the ﬁrst part of the sequence), the probability of taking

transition q → r (which is speciﬁed in the parameters, θ), the probability of generating observation

x

i+1

from state r (also speciﬁed in θ), and the probability of generating the rest of the sequence,

along any path. A visualization of the quantities used in computing this probability is shown in

Figure 6.7. In this illustration, we assume an HMM with S = {s

1

, s

2

, s

3

} and O = {a, b, c}.

The backward algorithm. Like the forward and Viterbi algorithms introduced above to answer

Questions 1 and 2, the backward algorithm uses dynamic programming to incrementally compute

β

t

(·). Its base case starts at time |x|, and is deﬁned as follows:

β

|x|

(q) = 1

To understand the intuition of the base case, keep in mind that since the backward probabilities

β

t

(·) are the probability of generating the remainder of the sequence after time t (as well as being

in some state), and since there is nothing left to generate after time |x|, the probability must be 1.

The recursion is deﬁned as follows:

β

t

(q) =

r∈S

β

t +1

(r) · A

q

(r) · B

r

(x

t +1

)

6.3. EMINMAPREDUCE 125

Unlike the forward and Viterbi algorithms, the backward algorithm is computed from right to left

and makes no reference to the start probabilities, π.

6.2.5 FORWARD-BACKWARDTRAINING: SUMMARY

In the preceding section, we showed how to compute all quantities needed to ﬁnd the parameter

settings θ

(i+1)

using EM training with a hidden Markov model M= S, O, θ

(i)

. To recap: each

training instance x is processed independently, using the parameter settings of the current iteration,

θ

(i)

. For each x in the training data, the forward and backward probabilities are computed using the

algorithms given above (for this reason, this training algorithm is often referred to as the forward-

backward algorithm). The forward and backward probabilities are in turn used to compute the

expected number of times the underlying Markov process enters into each state, the number of

times each state generates each output symbol type, and the number of times each state transitions

into each other state. These expectations are summed over all training instances, completing the

E-step. The M-step involves normalizing the expected counts computed in the E-step using the

calculations in Equation 6.3, which yields θ

(i+1)

. The process then repeats from the E-step using

the new parameters. The number of iterations required for convergence depends on the quality of

the initial parameters, and the complexity of the model. For some applications, only a handful of

iterations are necessary, whereas for others, hundreds may be required.

Finally, a fewpractical considerations: HMMs have a non-convex likelihood surface (meaning

that it has the equivalent of many hills and valleys in the number of dimensions corresponding to

the number of parameters in the model). EM training is only guaranteed to ﬁnd a local maximum,

and the quality of the learned model may vary considerably, depending on the initial parameters

that are used. Strategies for optimal selection of initial parameters depend on the phenomena being

modeled. Additionally, if some parameter is assigned a probability of 0 (either as an initial value or

during one of the M-step parameter updates), EM will never change this in future iterations. This

can be useful, since it provides a way of constraining the structures of the Markov model; however,

one must be aware of this behavior.

Another pitfall to avoidwhenimplementing HMMs is arithmetic underﬂow. HMMs typically

deﬁne a massive number of sequences, and so the probability of any one of them is often vanishingly

small—so small that they often underﬂow standard ﬂoating point representations. A very common

solution to this problemis to represent probabilities using their logarithms. Note that expected counts

do not typically have this problem and can be represented using normal ﬂoating point numbers. See

Section 5.4 for additional discussion on working with log probabilities.

6.3 EMINMAPREDUCE

Expectation maximization algorithms ﬁt quite naturally into the MapReduce programming model.

Although the model being optimized determines the details of the required computations, Map-

Reduce implementations of EM algorithms share a number of characteristics:

126 6. EMALGORITHMS FORTEXTPROCESSING

• Each iteration of EM is one MapReduce job.

• A controlling process (i.e., driver program) spawns the MapReduce jobs and keeps track of

the number of iterations and convergence criteria.

• Model parameters θ

(i)

, which are static for the duration of the MapReduce job, are loaded by

each mapper from HDFS or other data provider (e.g., a distributed key-value store).

• Mappers map over independent training instances, computing partial latent variable posteriors

(or summary statistics, such as expected counts).

• Reducers sum together the required training statistics and solve one or more of the M-step

optimization problems.

• Combiners, which sum together the training statistics, are often quite effective at reducing the

amount of data that must be shufﬂed across the network.

The degree of parallelization that can be attained depends on the statistical independence assumed

in the model and in the derived quantities required to solve the optimization problems in the M-step.

Since parameters are estimated from a collection of samples that are assumed to be i.i.d., the E-step

can generally be parallelized effectively since every training instance can be processed independently

of the others. In the limit, in fact, each independent training instance could be processed by a separate

mapper!

10

Reducers, however, must aggregate the statistics necessary to solve the optimization problems

as required by the model. The degree to which these may be solved independently depends on the

structure of the model, and this constrains the number of reducers that may be used. Fortunately,

many common models (such as HMMs) require solving several independent optimization problems

in the M-step. In this situation, a number of reducers may be run in parallel. Still, it is possible that in

the worst case, the M-step optimization problemwill not decompose into independent subproblems,

making it necessary to use a single reducer.

6.3.1 HMMTRAININGINMAPREDUCE

As we would expect, the training of hidden Markov models parallelizes well in MapReduce. The

process can be summarized as follows: in each iteration, mappers process training instances, emitting

expected event counts computed using the forward-backward algorithmintroduced in Section 6.2.4.

Reducers aggregate the expected counts, completing the E-step, and then generate parameter esti-

mates for the next iteration using the updates given in Equation 6.3.

This parallelization strategy is effective for several reasons. First, the majority of the com-

putational effort in HMM training is the running of the forward and backward algorithms. Since

there is no limit on the number of mappers that may be run, the full computational resources of a

10

Although the wisdom of doing this is questionable, given that the startup costs associated with individual map tasks in Hadoop

may be considerable.

6.3. EMINMAPREDUCE 127

cluster may be brought to bear to solve this problem. Second, since the M-step of an HMM training

iteration with |S| states in the model consists of 2 · |S| +1 independent optimization problems that

require non-overlapping sets of statistics, this may be exploited with as many as 2 · |S| +1 reducers

running in parallel. While the optimization problem is computationally trivial, being able to reduce

in parallel helps avoid the data bottleneck that would limit performance if only a single reducer were

used.

The statistics that are required to solve the M-step optimization problem are quite similar to

the relative frequency estimation example discussed in Section 3.3; however, rather than counts of

observed events, we aggregate expected counts of events. As a result of the similarity, we can employ

the stripes representation for aggregating sets of related values, as described in Section 3.2. A pairs

approach that requires less memory at the cost of slower performance is also feasible.

HMMtraining mapper. The pseudo-code for the HMM training mapper is given in Figure 6.8.

The input consists of key-value pairs with a unique id as the key and a training instance (e.g., a

sentence) as the value. For each training instance, 2n +1 stripes are emitted with unique keys,

and every training instance emits the same set of keys. Each unique key corresponds to one of the

independent optimization problems that will be solved in the M-step. The outputs are:

1. the probabilities that the unobserved Markov process begins in each state q, with a unique key

designating that the values are initial state counts;

2. the expected number of times that state q generated eachemissionsymbol o (the set of emission

symbols included will be just those found in each training instance x), with a key indicating

that the associated value is a set of emission counts from state q; and

3. the expected number of times state q transitions to each state r, with a key indicating that the

associated value is a set of transition counts from state q.

HMM training reducer. The reducer for one iteration of HMM training, shown together with

an optional combiner in Figure 6.9, aggregates the count collections associated with each key by

summing them. When the values for each key have been completely aggregated, the associative array

contains all of the statistics necessary to compute a subset of the parameters for the next EMiteration.

The optimal parameter settings for the following iteration are computed simply by computing the

relative frequency of each event with respect to its expected count at the current iteration. The new

computed parameters are emitted from the reducer and written to HDFS. Note that they will be

spread across 2 · |S| +1 keys, representing initial state probabilities π, transition probabilities A

q

for each state q, and emission probabilities B

q

for each state q.

128 6. EMALGORITHMS FORTEXTPROCESSING

1: class Mapper

2: method Initialize(integer iteration)

3: S, O ←ReadModel

4: θ ← A, B, π ←ReadModelParams(it erat ion)

5: method Map(sample id, sequence x)

6: α ←Forward(x, θ) cf. Section 6.2.2

7: β ←Backward(x, θ) cf. Section 6.2.4

8: I ← new AssociativeArray Initial state expectations

9: for all q ∈ S do Loop over states

10: I{q} ← α

1

(q) · β

1

(q)

11: O ← new AssociativeArray of AssociativeArray Emissions

12: for t = 1 to |x| do Loop over observations

13: for all q ∈ S do Loop over states

14: O{q}{x

t

} ← O{q}{x

t

} +α

t

(q) · β

t

(q)

15: t ← t +1

16: T ← new AssociativeArray of AssociativeArray Transitions

17: for t = 1 to |x| −1 do Loop over observations

18: for all q ∈ S do Loop over states

19: for all r ∈ S do Loop over states

20: T {q}{r} ← T {q}{r} +α

t

(q) · A

q

(r) · B

r

(x

t +1

) · β

t +1

(r)

21: t ← t +1

22: Emit(string ‘initial ’, stripe I)

23: for all q ∈ S do Loop over states

24: Emit(string ‘emit from ’ +q, stripe O{q})

25: Emit(string ‘transit from ’ +q, stripe T {q})

Figure 6.8: Mapper pseudo-code for training hidden Markov models using EM. The mappers map over

training instances (i.e., sequences of observations x

i

) and generate the expected counts of initial states,

emissions, and transitions taken to generate the sequence.

6.3. EMINMAPREDUCE 129

1: class Combiner

2: method Combine(string t, stripes [C

1

, C

2

, . . .])

3: C

f

← new AssociativeArray

4: for all stripe C ∈ stripes [C

1

, C

2

, . . .] do

5: Sum(C

f

, C)

6: Emit(string t, stripe C

f

)

1: class Reducer

2: method Reduce(string t, stripes [C

1

, C

2

, . . .])

3: C

f

← new AssociativeArray

4: for all stripe C ∈ stripes [C

1

, C

2

, . . .] do

5: Sum(C

f

, C)

6: z ← 0

7: for all k, v ∈ C

f

do

8: z ← z +v

9: P

f

← new AssociativeArray Final parameters vector

10: for all k, v ∈ C

f

do

11: P

f

{k} ← v/z

12: Emit(string t, stripe P

f

)

Figure 6.9: Combiner and reducer pseudo-code for training hidden Markov models using EM. The

HMMs considered in this book are fully parameterized by multinomial distributions, so reducers do not

require special logic to handle different types of model parameters (since they are all of the same type).

130 6. EMALGORITHMS FORTEXTPROCESSING

6.4 CASESTUDY: WORDALIGNMENTFORSTATISTICAL

MACHINETRANSLATION

To illustrate the real-world beneﬁts of expectation maximization algorithms using MapReduce, we

turn to the problem of word alignment, which is an important task in statistical machine translation

that is typically solved using models whose parameters are learned with EM.

We begin by giving a brief introduction to statistical machine translation and the phrase-based

translation approach; for a more comprehensive introduction, refer to [85; 97]. Fully automated

translation has been studied since the earliest days of electronic computers. After successes with

code-breaking during World War II, there was considerable optimism that translation of human

languages would be another soluble problem. In the early years, work on translation was dominated

by manual attempts to encode linguistic knowledge into computers—another instance of the ‘rule-

based’ approach we described in the introduction to this chapter. These early attempts failed to

live up to the admittedly overly optimistic expectations. For a number of years, the idea of fully

automated translation was viewed with skepticism. Not only was constructing a translation system

labor intensive, but translation pairs had to be developed independently, meaning that improvements

in a Russian-English translation system could not, for the most part, be leveraged to improve a

French-English system.

After languishing for a number of years, the ﬁeld was reinvigorated in the late 1980s when

researchers at IBM pioneered the development of statistical machine translation (SMT), which took

a data-driven approach to solving the problem of machine translation, attempting to improve both

the quality of translation while reducing the cost of developing systems [29]. The core idea of SMT

is to equip the computer to learn how to translate, using example translations which are produced

for other purposes, and modeling the process as a statistical process with some parameters θ relating

strings in a source language (typically denoted as f) to strings in a target language (typically denoted

as e):

e

∗

= arg max

e

Pr(e|f; θ)

With the statistical approach, translation systems can be developed cheaply and quickly for

any language pair, as long as there is sufﬁcient training data available. Furthermore, improvements

in learning algorithms and statistical modeling can yield beneﬁts in many translation pairs at once,

rather than being speciﬁc to individual language pairs. Thus, SMT, like many other topics we are

considering in this book, is an attempt to leverage the vast quantities of textual data that is available

to solve problems that would otherwise require considerable manual effort to encode specialized

knowledge. Since the advent of statistical approaches to translation, the ﬁeld has grown tremendously

and numerous statistical models of translation have been developed, with many incorporating quite

specialized knowledge about the behavior of natural language as biases in their learning algorithms.

6.4. CASESTUDY: WORDALIGNMENTFORSTATISTICAL MACHINETRANSLATION 131

6.4.1 STATISTICAL PHRASE-BASEDTRANSLATION

One approach to statistical translation that is simple yet powerful is called phrase-based translation

[86]. We provide a rough outline of the process since it is representative of most state-of-the-

art statistical translation systems, such as the one used inside Google Translate.

11

Phrase-based

translation works by learning how strings of words, called phrases, translate between languages.

12

Example phrase pairs for Spanish-English translation might include los estudiantes, the students,

los estudiantes, some students, and soy, i am. From a few hundred thousand sentences of example

translations, many millions of such phrase pairs may be automatically learned.

The starting point is typically a parallel corpus (also called bitext), which contains pairs of

sentences in two languages that are translations of each other. Parallel corpora are frequently gener-

ated as the by-product of an organization’s effort to disseminate information in multiple languages,

for example, proceedings of the Canadian Parliament in French and English, and text generated by

the United Nations in many different languages. The parallel corpus is then annotated with word

alignments, which indicate which words in one language correspond to words in the other. By using

these word alignments as a skeleton, phrases can be extracted from the sentence that is likely to

preserve the meaning relationships represented by the word alignment. While an explanation of the

process is not necessary here, we mention it as a motivation for learning word alignments, which we

show below how to compute with EM. After phrase extraction, each phrase pair is associated with

a number of scores which, taken together, are used to compute the phrase translation probability, a

conditional probability that reﬂects how likely the source phrase translates into the target phrase.

We brieﬂy note that although EM could be utilized to learn the phrase translation probabilities, this

is not typically done in practice since the maximum likelihood solution turns out to be quite bad for

this problem. The collection of phrase pairs and their scores are referred to as the translation model.

In addition to the translation model, phrase-based translation depends on a language model, which

gives the probability of a string in the target language. The translation model attempts to preserve

the meaning of the source language during the translation process, while the language model en-

sures that the output is ﬂuent and grammatical in the target language. The phrase-based translation

process is summarized in Figure 6.10.

A language model gives the probability that a string of words w = w

1

, w

2

, . . . , w

n

, written

as w

n

1

for short, is a string in the target language. By the chain rule of probability, we get:

Pr(w

n

1

) = Pr(w

1

) Pr(w

2

|w

1

) Pr(w

3

|w

2

1

) . . . Pr(w

n

|w

n−1

1

) =

n

k=1

Pr(w

k

|w

k−1

1

) (6.8)

Due to the extremely large number of parameters involved in estimating such a model directly, it is

customary to make the Markov assumption, that the sequence histories only depend on prior local

11

http://translate.google.com

12

Phrases are simply sequences of words; they are not required to correspond to the deﬁnition of a phrase in any linguistic theory.

132 6. EMALGORITHMS FORTEXTPROCESSING

Word Alignment Phrase Extraction

Training Data

i saw the small table

vi la mesa pequeña

(vi, i saw)

(la mesa pequeña, the small table)

…

Parallel Sentences

he sat at the table

the service was good

Target-Language Text

Translation Model

Language

Model

Target Language Text

Decoder

Foreign Input Sentence English Output Sentence

maria no daba una bofetada a la bruja verde mary did not slap the green witch

Figure 6.10: The standard phrase-based machine translation architecture. The translation model is

constructed with phrases extracted from a word-aligned parallel corpus. The language model is estimated

from a monolingual corpus. Both serve as input to the decoder, which performs the actual translation.

Maria no dio una bofetada a la bruja verde

Mary t i l t th it h Mary not

did not

no

give a slap to the witch green

slap

a slap

to the

green witch by

did not give to

the

the witch slap the witch slap

Figure 6.11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde by a phrase-

based model. The best possible translation path is indicated with a dashed line.

6.4. CASESTUDY: WORDALIGNMENTFORSTATISTICAL MACHINETRANSLATION 133

context. That is, an n-gram language model is equivalent to a (n −1)th-order Markov model. Thus,

we can approximate P(w

k

|w

k−1

1

) as follows:

bigrams: P(w

k

|w

k−1

1

) ≈ P(w

k

|w

k−1

) (6.9)

trigrams: P(w

k

|w

k−1

1

) ≈ P(w

k

|w

k−1

w

k−2

) (6.10)

n-grams: P(w

k

|w

k−1

1

) ≈ P(w

k

|w

k−1

k−n+1

) (6.11)

The probabilities used in computing Pr(w

n

1

) based on an n-gram language model are generally

estimated from a monolingual corpus of target language text. Since only target language text is

necessary (without any additional annotation), language modeling has been well served by large-

data approaches that take advantage of the vast quantities of text available on the web.

To translate an input sentence f, the phrase-based decoder creates a matrix of all translation

possibilities of all substrings in the input string, as an example illustrates in Figure 6.11. A sequence

of phrase pairs is selected such that each word in f is translated exactly once.

13

The decoder seeks

to ﬁnd the translation that maximizes the product of the translation probabilities of the phrases

used and the language model probability of the resulting string in the target language. Because the

phrase translation probabilities are independent of each other and the Markov assumption made

in the language model, this may be done efﬁciently using dynamic programming. For a detailed

introduction to phrase-based decoding, we refer the reader to a recent textbook by Koehn [85].

6.4.2 BRIEF DIGRESSION: LANGUAGEMODELINGWITHMAPREDUCE

Statistical machine translation provides the context for a brief digression on distributed parameter

estimation for language models using MapReduce, and provides another example illustrating the

effectiveness of data-driven approaches in general. We brieﬂy touched upon this work in Chapter 1.

Evenafter making the Markov assumption, training n-gramlanguage models still requires estimating

anenormous number of parameters: potentially V

n

, where V is the number of words inthe vocabulary.

For higher-order models (e.g., 5-grams) used in real-world applications, the number of parameters

caneasily exceedthe number of words fromwhichto estimate those parameters. Infact, most n-grams

will never be observed in a corpus, no matter how large. To cope with this sparseness, researchers

have developed a number of smoothing techniques [102], which all share the basic idea of moving

probability mass from observed to unseen events in a principled manner. For many applications, a

state-of-the-art approach is known as Kneser-Ney smoothing [35].

In 2007, Brants et al. [25] reported experimental results that answered an interesting question:

given the availability of large corpora (i.e., the web), could a simpler smoothing strategy, applied to

more text, beat Kneser-Ney in a machine translation task? It should come as no surprise that the

answer is yes. Brants et al. introduced a technique known as “stupid backoff ” that was exceedingly

simple and so naïve that the resulting model didn’t even deﬁne a valid probability distribution (it

13

The phrases may not necessarily be selected in a strict left-to-right order. Being able to vary the order of the phrases used is

necessary since languages may express the same ideas using different word orders.

134 6. EMALGORITHMS FORTEXTPROCESSING

assigned arbitrary scores as opposed to probabilities). The simplicity, however, afforded an extremely

scalable implementations in MapReduce. With smaller corpora, stupid backoff didn’t work as well as

Kneser-Ney in generating accurate and ﬂuent translations. However, as the amount of data increased,

the gap betweenstupid backoff and Kneser-Ney narrowed, and eventually disappeared withsufﬁcient

data. Furthermore, with stupid backoff it was possible to train a language model on more data than

was feasible with Kneser-Ney smoothing. Applying this language model to a machine translation

task yielded better results than a (smaller) language model trained with Kneser-Ney smoothing.

The role of the language model instatistical machine translationis to select ﬂuent, grammatical

translations from a large hypothesis space: the more training data a language model has access to, the

better its description of relevant language phenomena and hence its ability to select good translations.

Once again, large data triumphs! For more information about estimating language models using

MapReduce, we refer the reader to a forthcoming book from Morgan & Claypool [26].

6.4.3 WORDALIGNMENT

Word alignments, which are necessary for building phrase-based translation models (as well as many

other more sophisticated translation models), can be learned automatically using EM. In this section,

we introduce a popular alignment model based on HMMs.

In the statistical model of word alignment considered here, the observable variables are the

words in the source and target sentences (conventionally written using the variables f and e, respec-

tively), and their alignment is the latent variable. To make this model tractable, it is assumed that

words are translated independently of one another, which means that the model’s parameters include

the probability of any word in the source language translating to any word in the target language.

While this independence assumption is problematic in many ways, it results in a simple model

structure that admits efﬁcient inference yet produces reasonable alignments. Alignment models that

make this assumption generate a string e in the target language by selecting words in the source

language according to a lexical translation distribution. The indices of the words in f used to generate

each word in e are stored in an alignment variable, a.

14

This means that the variable a

i

indicates the

source word position of the i

th

target word generated, and |a| = |e|. Using these assumptions, the

probability of an alignment and translation can be written as follows:

Pr(e, a|f) = Pr(a|f, e)

. ,, .

Alignment probability

×

|e|

i=1

Pr(e

i

|f

a

i

)

. ,, .

Lexical probability

Since we have parallel corpora consisting of only f, e pairs, we can learn the parameters for this

model using EMand treating a as a latent variable. However, to combat data sparsity in the alignment

14

In the original presentation of statistical lexical translation models, a special null word is added to the source sentences, which

permits words to be inserted ‘out of nowhere’. Since this does not change any of the important details of training, we omit it from

our presentation for simplicity.

6.4. CASESTUDY: WORDALIGNMENTFORSTATISTICAL MACHINETRANSLATION 135

probability, we must make some further simplifying assumptions. By letting the probability of an

alignment depend only on the position of the previous aligned word we capture a valuable insight

(namely, words that are nearby in the source language will tend to be nearby in the target language),

and our model acquires the structure of an HMM [150]:

Pr(e, a|f) =

|e|

i=1

Pr(a

i

|a

i−1

)

. ,, .

Transition probability

×

|e|

i=1

Pr(e

i

|f

a

i

)

. ,, .

Emission probability

This model can be trained using the forward-backward algorithm described in the previous section,

summing over all settings of a, and the best alignment for a sentence pair can be found using the

Viterbi algorithm.

To properly initialize this HMM, it is conventional to further simplify the alignment proba-

bility model, and use this simpler model to learn initial lexical translation (emission) parameters for

the HMM. The favored simpliﬁcation is to assert that all alignments are uniformly probable:

Pr(e, a|f) =

1

|f|

|e|

×

|e|

i=1

Pr(e

i

|f

a

i

)

This model is knownas IBMModel 1. It is attractive for initializationbecause it is convex everywhere,

and therefore EMwill learn the same solution regardless of initialization. Finally, while the forward-

backward algorithm could be used to compute the expected counts necessary for training this model

by setting A

q

(r) to be a constant value for all q and r, the uniformity assumption means that the

expected emission counts can be estimated in time O(|e| · |f|), rather than time O(|e| · |f|

2

) required

by the forward-backward algorithm.

6.4.4 EXPERIMENTS

How well does a MapReduce word aligner for statistical machine translation perform? We describe

previously published results [54] that compared a Java-based Hadoop implementation against a

highly optimized word aligner called Giza++ [112], which was written in C++ and designed to run

efﬁciently on a single core. We compared the training time of Giza++ and our aligner on a Hadoop

cluster with 19 slave nodes, each with two single-core processors and two disks (38 cores total).

Figure 6.12 shows the performance of Giza++ in terms of the running time of a single EM

iteration for both Model 1 and the HMM alignment model as a function of the number of training

pairs. Both axes in the ﬁgure are on a log scale, but the ticks on the y-axis are aligned with‘meaningful’

time intervals rather than exact orders of magnitude. There are three things to note. First, the running

time scales linearly with the size of the training data. Second, the HMM is a constant factor slower

than Model 1. Third, the alignment process is quite slow as the size of the training data grows—

136 6. EMALGORITHMS FORTEXTPROCESSING

at one million sentences, a single iteration takes over three hours to complete! Five iterations are

generally necessary to train the models, which means that full training takes the better part of a day.

In Figure 6.13 we plot the running time of our MapReduce implementation running on the

38-core cluster described above. For reference, we plot points indicating what 1/38 of the running

time of the Giza++ iterations would be at each data size, which gives a rough indication of what an

‘ideal’ parallelization could achieve, assuming that there was no overhead associated with distributing

computation across these machines. Three things may be observed in the results. First, as the amount

of data increases, the relative cost of the overhead associated with distributing data, marshaling and

aggregating counts, decreases. At one million sentence pairs of training data, the HMM alignment

iterations begin to approach optimal runtime efﬁciency. Second, Model 1, which we observe is

light on computation, does not approach the theoretical performance of an ideal parallelization,

and in fact, has almost the same running time as the HMM alignment algorithm. We conclude

that the overhead associated with distributing and aggregating data is signiﬁcant compared to the

Model 1 computations, although a comparison with Figure 6.12 indicates that the MapReduce

implementation is still substantially faster than the single core implementation, at least once a

certain training data size is reached. Finally, we note that, in comparison to the running times of

the single-core implementation, at large data sizes, there is a signiﬁcant advantage to using the

distributed implementation, even of Model 1.

Althoughthese results doconfoundseveral variables ( Java vs. C++performance, memory usage

patterns), it is reasonable to expect that the confounds would tend to make the single-core system’s

performance appear relatively better than the MapReduce system (which is, of course, the opposite

pattern from what we actually observe). Furthermore, these results show that when computation is

distributed over a cluster of many machines, even an unsophisticated implementation of the HMM

aligner could compete favorably with a highly optimized single-core system whose performance is

well-known to many people in the MT research community.

Why are these results important? Perhaps the most signiﬁcant reason is that the quantity of

parallel data that is available to train statistical machine translation models is ever increasing, and as is

the case with so many problems we have encountered, more data lead to improvements in translation

quality [54]. Recently, a corpus of one billion words of French-English data was mined automatically

from the web and released publicly [33].

15

Single-core solutions to model construction simply

cannot keep pace with the amount of translated data that is constantly being produced. Fortunately,

several independent researchers have shown that existing modeling algorithms can be expressed

naturally and effectively using MapReduce, which means that we can take advantage of this data.

Furthermore, the results presented here show that even at data sizes that may be tractable on single

machines, signiﬁcant performance improvements are attainable using MapReduce implementations.

This improvement reduces experimental turnaround times, which allows researchers to more quickly

explore the solution space—which will, we hope, lead to rapid new developments in statistical

machine translation.

15

http://www.statmt.org/wmt10/translation-task.html

6.4. CASESTUDY: WORDALIGNMENTFORSTATISTICAL MACHINETRANSLATION 137

3 s

10 s

30 s

90 s

5 min

20 min

60 min

3 hrs

10000 100000 1e+06

A

v

e

r

a

g

e

i

t

e

r

a

t

i

o

n

l

a

t

e

n

c

y

(

s

e

c

o

n

d

s

)

Corpus size (sentences)

Model 1

HMM

Figure 6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM training

iterations at various corpus sizes.

3 s

10 s

30 s

90 s

5 min

20 min

60 min

3 hrs

10000 100000 1e+06

T

i

m

e

(

s

e

c

o

n

d

s

)

Corpus size (sentences)

Optimal Model 1 (Giza/38)

Optimal HMM (Giza/38)

MapReduce Model 1 (38 M/R)

MapReduce HMM (38 M/R)

Figure 6.13: Running times of our MapReduce implementation of Model 1 and HMM training itera-

tions at various corpus sizes. For reference, 1/38 running times of the Giza++ models are shown.

138 6. EMALGORITHMS FORTEXTPROCESSING

For the reader interested in statistical machine translation, there is an open source Hadoop-

based MapReduce implementation of a training pipeline for phrase-based translation that includes

word alignment, phrase extraction, and phrase scoring [56].

6.5 EM-LIKEALGORITHMS

This chapter has focused on expectation maximization algorithms and their implementation in the

MapReduce programming framework. These important algorithms are indispensable for learning

models with latent structure from unannotated data, and they can be implemented quite naturally

in MapReduce. We now explore some related learning algorithms that are similar to EM but can be

used to solve more general problems, and discuss their implementation.

In this section we focus on gradient-based optimization, which refers to a class of techniques

used to optimize any objective function, provided it is differentiable with respect to the parameters

being optimized. Gradient-based optimization is particularly useful in the learning of maximum

entropy (maxent) models [110] and conditional random ﬁelds (CRF) [87] that have an exponential

formand are trained to maximize conditional likelihood. In addition to being widely used supervised

classiﬁcation models in text processing (meaning that during training, both the data and their

annotations must be observable), their gradients take the form of expectations. As a result, some of

the previously introduced techniques are also applicable for optimizing these models.

6.5.1 GRADIENT-BASEDOPTIMIZATIONANDLOG-LINEARMODELS

Gradient-based optimization refers to a class of iterative optimization algorithms that use the deriva-

tives of a function to ﬁnd the parameters that yield a minimal or maximal value of that function.

Obviously, these algorithms are only applicable in cases where a useful objective exists, is differen-

tiable, and its derivatives can be efﬁciently evaluated. Fortunately, this is the case for many important

problems of interest in text processing. For the purposes of this discussion, we will give examples in

terms of minimizing functions.

Assume that we have some real-valued function F(θ) where θ is a k-dimensional vector and

that F is differentiable with respect to θ. Its gradient is deﬁned as:

∇F(θ) =

_

∂F

∂θ

1

(θ),

∂F

∂θ

2

(θ), . . . ,

∂F

∂θ

k

(θ)

_

The gradient has two crucial properties that are exploited in gradient-based optimization. First, the

gradient ∇F is a vector ﬁeld that points in the direction of the greatest increase of F and whose

magnitude indicates the rate of increase. Second, if θ

∗

is a (local) minimum of F, then the following

is true:

∇F(θ

∗

) = 0

6.5. EM-LIKEALGORITHMS 139

An extremely simple gradient-based minimization algorithm produces a series of parameter

estimates θ

(1)

, θ

(2)

, . . . by starting with some initial parameter settings θ

(1)

and updating parameters

through successive iterations according to the following rule:

θ

(i+1)

= θ

(i)

−η

(i)

∇F(θ

(i)

) (6.12)

The parameter η

(i)

> 0 is a learning rate which indicates how quickly the algorithm moves along

the gradient during iteration i. Provided this value is small enough that F decreases, this strategy

will ﬁnd a local minimum of F. However, while simple, this update strategy may converge slowly,

and proper selection of η is non-trivial. More sophisticated algorithms perform updates that are

informed by approximations of the second derivative, which are estimated by successive evaluations

of ∇F(θ), and can converge much more rapidly [96].

Gradient-based optimization in MapReduce. Gradient-based optimization algorithms can of-

ten be implemented effectively in MapReduce. Like EM, where the structure of the model deter-

mines the speciﬁcs of the realization, the details of the function being optimized determines how

it should best be implemented, and not every function optimization problem will be a good ﬁt for

MapReduce. Nevertheless, MapReduce implementations of gradient-based optimization tend to

have the following characteristics:

• Each optimization iteration is one MapReduce job.

• The objective should decompose linearly across training instances. This implies that the gra-

dient also decomposes linearly, and therefore mappers can process input data in parallel. The

values they emit are pairs F(θ), ∇F(θ), which are linear components of the objective and

gradient.

• Evaluations of the function and its gradient are often computationally expensive because they

require processing lots of data. This make parallelization with MapReduce worthwhile.

• Whether more than one reducer can run in parallel depends on the speciﬁc optimization

algorithm being used. Some, like the trivial algorithm of Equation 6.12 treat the dimensions

of θ independently, whereas many are sensitive to global properties of ∇F(θ). In the latter

case, parallelization across multiple reducers is non-trivial.

• Reducer(s) sum the component objective/gradient pairs, compute the total objective and gra-

dient, run the optimization algorithm, and emit θ

(i+1)

.

• Many optimization algorithms are stateful and must persist their state between optimization

iterations. This may either be emitted together with θ

(i+1)

or written to the distributed ﬁle

systemas a side effect of the reducer. Such external side effects must be handled carefully; refer

to Section 2.2 for a discussion.

140 6. EMALGORITHMS FORTEXTPROCESSING

Parameter learning for log-linear models. Gradient-based optimization techniques can be quite

effectively used to learn the parameters of probabilistic models with a log-linear parameterization

[100]. While a comprehensive introduction to these models is beyond the scope of this book, such

models are used extensively in text processing applications, and their training using gradient-based

optimization, which may otherwise be computationally expensive, can be implemented effectively

using MapReduce. We therefore include a brief summary.

Log-linear models are particularly useful for supervised learning (unlike the unsupervised

models learned with EM), where an annotation y ∈ Y is available for every x ∈ X in the training

data. In this case, it is possible to directly model the conditional distribution of label given input:

Pr(y|x; θ) =

exp

i

θ

i

· H

i

(x, y)

y

exp

i

θ

i

· H

i

(x, y

)

In this expression, H

i

are real-valued functions sensitive to features of the input and labeling. The

parameters of the model is selected so as to minimize the negative conditional log likelihood of a

set of training instances x, y

1

, x, y

2

, . . ., which we assume to be i.i.d.:

F(θ) =

x,y

−log Pr(y|x; θ) (6.13)

θ

∗

= arg min

θ

F(θ) (6.14)

As Equation 6.13 makes clear, the objective decomposes linearly across training instances, meaning

it can be optimized quite well in MapReduce. The gradient derivative of F with respect to θ

i

can

be shown to have the following form [141]:

16

∂F

∂θ

i

(θ) =

x,y

_

H

i

(x, y) −E

Pr(y

|x;θ)

[H

i

(x, y

)]

_

The expectation in the second part of the gradient’s expression can be computed using a variety

of techniques. However, as we saw with EM, when very large event spaces are being modeled, as

is the case with sequence labeling, enumerating all possible values y can become computationally

intractable. And, as was the case with HMMs, independence assumptions can be used to enable

efﬁcient computation using dynamic programming. In fact, the forward-backward algorithm intro-

duced in Section 6.2.4 can, with only minimal modiﬁcation, be used to compute the expectation

E

Pr(y

|x;θ)

[H

i

(x, y

**)] needed in CRF sequence models, as long as the feature functions respect the
**

same Markov assumption that is made in HMMs. For more information about inference in CRFs

using the forward-backward algorithm, we refer the reader to Sha et al. [140].

As we saw in the previous section, MapReduce offers signiﬁcant speedups when training

iterations require running the forward-backward algorithm. The same pattern of results holds when

training linear CRFs.

16

This assumes that when x, y is present the model is fully observed (i.e., there are no additional latent variables).

6.6. SUMMARY ANDADDITIONAL READINGS 141

6.6 SUMMARY ANDADDITIONAL READINGS

This chapter focused on learning the parameters of statistical models from data, using expectation

maximization algorithms or gradient-based optimization techniques. We focused especially on EM

algorithms for three reasons. First, these algorithms can be expressed naturally in the MapReduce

programming model, making thema good example of howto express a commonly used algorithmin

this new framework. Second, many models, such as the widely used hidden Markov model (HMM)

trained using EM, make independence assumptions that permit an high degree of parallelismin both

the E- and M-steps. Thus, they are particularly well positioned to take advantage of large clusters.

Finally, EM algorithms are unsupervised learning algorithms, which means that they have access to

far more training data than comparable supervised approaches. This is quite important. In Chapter 1,

when we hailed large data as the “rising tide that lifts all boats” to yield more effective algorithms,

we were mostly referring to unsupervised approaches, given that the manual effort required to

generate annotated data remains a bottleneck in many supervised approaches. Data acquisition for

unsupervised algorithms is often as simple as crawling speciﬁc web sources, given the enormous

quantities of data available “for free”. This, combined with the ability of MapReduce to process

large datasets in parallel, provides researchers with an effective strategy for developing increasingly

effective applications.

Since EM algorithms are relatively computationally expensive, even for small amounts of

data, this led us to consider how related supervised learning models (which typically have much less

training data available), can also be implemented in MapReduce. The discussion demonstrates that

not only does MapReduce provide a means for coping with ever-increasing amounts of data, but it is

also useful for parallelizing expensive computations. Although MapReduce has been designed with

mostly data-intensive applications in mind, the ability to leverage clusters of commodity hardware

to parallelize computationally expensive algorithms is an important use case.

Additional Readings. Because of its ability to leverage large amounts of training data, machine

learning is an attractive problem for MapReduce and an area of active research. Chu et al. [37]

presented general formulations of a variety of machine learning problems, focusing on a normal

form for expressing a variety of machine learning algorithms in MapReduce. The Apache Mahout

project is an open-source implementation of these and other learning algorithms,

17

and it is also

the subject of a forthcoming book [116]. Issues associated with a MapReduce implementation of

latent Dirichlet allocation (LDA), which is another important unsupervised learning technique, with

certain similarities to EM, have been explored by Wang et al. [151].

17

http://lucene.apache.org/mahout/

143

C H A P T E R 7

Closing Remarks

The need to process enormous quantities of data has never been greater. Not only are terabyte-

and petabyte-scale datasets rapidly becoming commonplace, but there is consensus that great value

lies buried in them, waiting to be unlocked by the right computational tools. In the commercial

sphere, business intelligence—driven by the ability to gather data from a dizzying array of sources—

promises to help organizations better understand their customers and the marketplace, hopefully

leading to better business decisions and competitive advantages. For engineers building information

processing tools and applications, larger datasets lead to more effective algorithms for a wide range

of tasks, from machine translation to spam detection. In the natural and physical sciences, the ability

to analyze massive amounts of data may provide the key to unlocking the secrets of the cosmos or

the mysteries of life.

In the preceding chapters, we have shown how MapReduce can be exploited to solve a variety

of problems related to text processing at scales that would have been unthinkable a few years ago.

However, no tool—no matter howpowerful or ﬂexible—can be perfectly adapted to every task, so it is

only fair to discuss the limitations of the MapReduce programming model and survey alternatives.

Section 7.1 covers online learning algorithms and Monte Carlo simulations, which are examples of

algorithms that require maintaining global state. As we have seen, this is difﬁcult to accomplish

in MapReduce. Section 7.2 discusses alternative programming models, and the book concludes in

Section 7.3.

7.1 LIMITATIONS OF MAPREDUCE

As we have seen throughout this book, solutions to many interesting problems in text processing

do not require global synchronization. As a result, they can be expressed naturally in MapReduce,

since map and reduce tasks run independently and in isolation. However, there are many examples

of algorithms that depend crucially on the existence of shared global state during processing, making

them difﬁcult to implement in MapReduce (since the single opportunity for global synchronization

in MapReduce is the barrier between the map and reduce phases of processing).

The ﬁrst example is online learning. Recall from Chapter 6 the concept of learning as the

setting of parameters in a statistical model. Both EM and the gradient-based learning algorithms

we described are instances of what are known as batch learning algorithms. This simply means that

the full “batch” of training data is processed before any updates to the model parameters are made.

On one hand, this is quite reasonable: updates are not made until the full evidence of the training

data has been weighed against the model. An earlier update would seem, in some sense, to be hasty.

144 7. CLOSINGREMARKS

However, it is generally the case that more frequent updates can lead to more rapid convergence

of the model (in terms of number of training instances processed), even if those updates are made

by considering less data [24]. Thinking in terms of gradient optimization (see Section 6.5), online

learning algorithms can be understood as computing an approximation of the true gradient, using

only a few training instances. Although only an approximation, the gradient computed from a small

subset of training instances is often quite reasonable, and the aggregate behavior of multiple updates

tends to even out errors that are made. In the limit, updates can be made after every training instance.

Unfortunately, implementing online learning algorithms in MapReduce is problematic. The

model parameters in a learning algorithm can be viewed as shared global state, which must be

updated as the model is evaluated against training data. All processes performing the evaluation

(presumably the mappers) must have access to this state. In a batch learner, where updates occur

in one or more reducers (or, alternatively, in the driver code), synchronization of this resource is

enforced by the MapReduce framework. However, with online learning, these updates must occur

after processing smaller numbers of instances. This means that the framework must be altered to

support faster processing of smaller datasets, which goes against the design choices of most existing

MapReduce implementations. Since MapReduce was speciﬁcally optimized for batch operations

over large amounts of data, such a style of computation would likely result in inefﬁcient use of

resources. In Hadoop, for example, map and reduce tasks have considerable startup costs. This is

acceptable because in most circumstances, this cost is amortized over the processing of many key-

value pairs. However, for small datasets, these high startup costs become intolerable. An alternative

is to abandon shared global state and run independent instances of the training algorithm in parallel

(on different portions of the data). A ﬁnal solution is then arrived at by merging individual results.

Experiments, however, showthat the merged solution is inferior to the output of running the training

algorithm on the entire dataset [52].

A related difﬁculty occurs when running what are called Monte Carlo simulations, which are

used to performinference in probabilistic models where evaluating or representing the model exactly

is impossible. The basic idea is quite simple: samples are drawn from the random variables in the

model to simulate its behavior, and then simple frequency statistics are computed over the samples.

This sort of inference is particularly useful when dealing with so-called nonparametric models, which

are models whose structure is not speciﬁed in advance, but is rather inferred from training data.

For an illustration, imagine learning a hidden Markov model, but inferring the number of states,

rather than having them speciﬁed. Being able to parallelize Monte Carlo simulations would be

tremendously valuable, particularly for unsupervised learning applications where they have been

found to be far more effective than EM-based learning (which requires specifying the model).

Although recent work [10] has shown that the delays in synchronizing sample statistics due to

parallel implementations do not necessarily damage the inference, MapReduce offers no natural

mechanismfor managing the global shared state that would be required for such an implementation.

The problem of global state is sufﬁciently pervasive that there has been substantial work on

solutions. One approach is to build a distributed datastore capable of maintaining the global state.

7.2. ALTERNATIVECOMPUTINGPARADIGMS 145

However, such a systemwould need to be highly scalable to be used in conjunction with MapReduce.

Google’s Bigtable [34], which is a sparse, distributed, persistent multidimensional sorted map built

on top of GFS, ﬁts the bill, and has been used in exactly this manner. Amazon’s Dynamo [48], which

is a distributed key-value store (with a very different architecture), might also be useful in this respect,

although it wasn’t originally designed with such an application in mind. Unfortunately, it is unclear

if the open-source implementations of these two systems (HBase and Cassandra, respectively) are

sufﬁciently mature to handle the low-latency and high-throughput demands of maintaining global

state in the context of massively distributed processing (but recent benchmarks are encouraging [40]).

7.2 ALTERNATIVECOMPUTINGPARADIGMS

Streaming algorithms [3] represent analternative programming model for dealing withlarge volumes

of data with limited computational and storage resources. This model assumes that data are presented

to the algorithmas one or more streams of inputs that are processed inorder, and only once.The model

is agnostic with respect to the source of these streams, which could be ﬁles in a distributed ﬁle system,

but more interestingly, data from an “external” source or some other data gathering device. Stream

processing is very attractive for working with time-series data (news feeds, tweets, sensor readings,

etc.), which is difﬁcult in MapReduce (once again, given its batch-oriented design). Furthermore,

since streaming algorithms are comparatively simple (because there is only so much that can be

done with a particular training instance), they can often take advantage of modern GPUs, which

have a large number of (relatively simple) functional units [104]. In the context of text processing,

streaming algorithms have been applied to language modeling [90], translation modeling [89], and

detecting the ﬁrst mention of news event in a stream [121].

The idea of stream processing has been generalized in the Dryad framework as arbitrary

dataﬂow graphs [75; 159]. A Dryad job is a directed acyclic graph where each vertex represents

developer-speciﬁed computations and edges represent data channels that capture dependencies. The

dataﬂow graph is a logical computation graph that is automatically mapped onto physical resources

by the framework. At runtime, channels are used to transport partial results between vertices, and

can be realized using ﬁles, TCP pipes, or shared memory.

Another system worth mentioning is Pregel [98], which implements a programming model

inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. Pregel was speciﬁcally designed

for large-scale graph algorithms, but unfortunately there are few published details at present. How-

ever, a longer description is anticipated in a forthcoming paper [99].

What is the signiﬁcance of these developments? The power of MapReduce derives from

providing an abstraction that allows developers to harness the power of large clusters. As anyone

who has taken an introductory computer science course would know, abstractions manage complexity

by hiding details and presenting well-deﬁned behaviors to users of those abstractions. This process

makes certain tasks easier, but others more difﬁcult, if not impossible. MapReduce is certainly no

exception to this generalization, and one of the goals of this book has been to give the reader a

better understanding of what’s easy to do in MapReduce and what its limitations are. But of course,

146 7. CLOSINGREMARKS

this begs the obvious question: What other abstractions are available in the massively distributed

datacenter environment? Are there more appropriate computational models that would allow us to

tackle classes of problems that are difﬁcult for MapReduce?

Dryad and Pregel are alternative answers to these questions. They share in providing an ab-

straction for large-scale distributed computations, separating the what from the how of computation

and isolating the developer fromthe details of concurrent programming. They differ, however, in how

distributed computations are conceptualized: functional-style programming, arbitrary dataﬂows, or

BSP. These conceptions represent different trade offs between simplicity and expressivity: for exam-

ple, Dryad is more ﬂexible than MapReduce, and in fact, MapReduce can be trivially implemented

in Dryad. However, it remains unclear, at least at present, which approach is more appropriate for

different classes of applications. Looking forward, we can certainly expect the development of new

models and a better understanding of existing ones. MapReduce is not the end, and perhaps not

even the best. It is merely the ﬁrst of many approaches to harness large-scaled distributed computing

resources.

Even within the Hadoop/MapReduce ecosystem, we have already observed the development

of alternative approaches for expressing distributed computations. For example, there is a proposal to

add a third merge phase after map and reduce to better support relational operations [36]. Pig [114],

which was inspired by Google’s Sawzall [122], can be described as a data analytics platform that

provides a lightweight scripting language for manipulating large datasets. Although Pig scripts (in a

language called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine, con-

structs in the language allow developers to specify data transformations (ﬁltering, joining, grouping,

etc.) at a much higher level. Similarly, Hive [68], another open-source project, provides an abstraction

on top of Hadoop that allows users to issue SQL queries against large relational datasets stored in

HDFS. Hive queries (in HiveQL) “compile down” to Hadoop jobs by the Hive query engine. There-

fore, the system provides a data analysis tool for users who are already comfortable with relational

databases, while simultaneously taking advantage of Hadoop’s data processing capabilities.

7.3 MAPREDUCEANDBEYOND

The capabilities necessary to tackle large-data problems are already within reach by many and will

continue to become more accessible over time. By scaling “out” with commodity servers, we have

been able to economically bring large clusters of machines to bear on problems of interest. But

this has only been possible with corresponding innovations in software and how computations are

organized on a massive scale. Important ideas include: moving processing to the data, as opposed to

the other way around; also, emphasizing throughput over latency for batch tasks by sequential scans

through data, avoiding random seeks. Most important of all, however, is the development of new

abstractions that hide system-level details from the application developer. These abstractions are

at the level of entire datacenters, and provide a model using which programmers can reason about

computations at a massive scale without being distracted by ﬁne-grained concurrency management,

7.3. MAPREDUCEANDBEYOND 147

fault tolerance, error recovery, and a host of other issues in distributed computing. This, in turn,

paves the way for innovations in scalable algorithms that can run on petabyte-scale datasets.

None of these points are newor particularly earth shattering—computer scientists have known

about these principles for decades. However, MapReduce is unique in that, for the ﬁrst time, all these

ideas came together and were demonstrated on practical problems at scales unseen before, both in

terms of computational resources and the impact on the daily lives of millions. The engineers at

Google deserve a tremendous amount of credit for that, and also for sharing their insights with

the rest of the world. Furthermore, the engineers and executives at Yahoo deserve a lot of credit

for starting the open-source Hadoop project, which has made MapReduce accessible to everyone

and created the vibrant software ecosystem that ﬂourishes today. Add to that the advent of utility

computing, which eliminates capital investments associated with cluster infrastructure, large-data

processing capabilities are now available “to the masses” with a relatively low barrier to entry.

The golden age of massively distributed computing is ﬁnally upon us.

149

Bibliography

[1] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander

Rasin. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for

analytical workloads. In Proceedings of the 35th International Conference on Very Large Data

Base (VLDB 2009), pages 922–933, Lyon, France, 2009. 59

[2] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews

of Modern Physics, 74:47–97, 2002. DOI: 10.1103/RevModPhys.74.47 92

[3] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the

frequency moments. In Proceedings of the 28th Annual ACMSymposiumonTheory of Computing

(STOC ’96), pages 20–29, Philadelphia, Pennsylvania, 1996. DOI: 10.1145/237814.237823

145

[4] Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and

Russell C. Sears. BOOM: Data-centric programming in the datacenter. Technical Report

UCB/EECS-2009-98, Electrical Engineering and Computer Sciences, University of Cali-

fornia at Berkeley, 2009. 32

[5] Gene Amdahl. Validity of the single processor approach to achieving large-scale computing

capabilities. In Proceedings of the AFIPS Spring Joint Computer Conference, pages 483–485,

1967. DOI: 10.1145/1465482.1465560 17

[6] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit

Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do we really need to reinvent the

storage stack? In Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing (HotCloud

09), San Diego, California, 2009. 29

[7] Thomas Anderson, Michael Dahlin, Jeanna Neefe, David Patterson, Drew Roselli, and Ran-

dolph Wang. Serverless network ﬁle systems. In Proceedings of the 15th ACM Symposium on

Operating Systems Principles (SOSP 1995), pages 109–126, Copper Mountain Resort, Col-

orado, 1995. DOI: 10.1145/224056.224066 29

[8] Vo Ngoc Anh and Alistair Moffat. Inverted index compression using word-aligned binary

codes. Information Retrieval, 8(1):151–166, 2005.

DOI: 10.1023/B:INRT.0000048490.99518.5c 76

150 7. CLOSINGREMARKS

[9] Michael Armbrust, ArmandoFox, ReanGrifﬁth, Anthony D. Joseph, Randy H. Katz, Andrew

Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia.

Above the clouds: ABerkeley viewof cloud computing. Technical Report UCB/EECS-2009-

28, Electrical Engineering and Computer Sciences, University of California at Berkeley, 2009.

6

[10] Arthur Asuncion, Padhraic Smyth, and Max Welling. Asynchronous distributed learning of

topic models. In Advances in Neural Information Processing Systems 21 (NIPS 2008), pages

81–88, Vancouver, British Columbia, Canada, 2008. 144

[11] Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, and Fabrizio

Silvestri. Challenges on distributed web retrieval. In Proceedings of the IEEE 23rd Inter-

national Conference on Data Engineering (ICDE 2007), pages 6–20, Istanbul, Turkey, 2007.

DOI: 10.1109/ICDE.2007.367846 83

[12] Ricardo Baeza-Yates, Carlos Castillo, and Vicente López. PageRank increase under dif-

ferent collusion topologies. In Proceedings of the First International Workshop on Adversarial

Information Retrieval on the Web (AIRWeb 2005), pages 17–24, Chiba, Japan, 2005. 96

[13] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vassilis Pla-

chouras, and Fabrizio Silvestri. The impact of caching on search engines. In Proceedings

of the 30th Annual International ACM SIGIR Conference on Research and Development in

Information Retrieval (SIGIR 2007), pages 183–190, Amsterdam, The Netherlands, 2007.

DOI: 10.1145/1277741.1277775 82

[14] Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disam-

biguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Lin-

guistics (ACL 2001), pages 26–33, Toulouse, France, 2001. DOI: 10.3115/1073012.1073017

4

[15] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neuge-

bauer, Ian Pratt, and Andrew Warﬁeld. Xen and the art of virtualization. In Proceedings of the

19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages 164–177, Bolton

Landing, New York, 2003. DOI: 10.1145/945445.945462 6

[16] Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. Web search for a planet: The Google

cluster architecture. IEEE Micro, 23(2):22–28, 2003. DOI: 10.1109/MM.2003.1196112 82

[17] Luiz André Barroso and Urs Hölzle. The case for energy-proportional computing. Computer,

40(12):33–37, 2007. DOI: 10.1109/MC.2007.443 9

[18] Luiz André Barroso and Urs Hölzle. The Datacenter as a Computer: An Introduc-

tion to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009.

DOI: 10.2200/S00193ED1V01Y200905CAC006 8, 9, 10, 14

7.3. MAPREDUCEANDBEYOND 151

[19] Jacek Becla, Andrew Hanushevsky, Sergei Nikolaev, Ghaleb Abdulla, Alex Szalay, Maria

Nieto-Santisteban, Ani Thakar, and Jim Gray. Designing a multi-petabyte database for

LSST. SLAC Publications SLAC-PUB-12292, Stanford Linear Accelerator Center, May

2006. 2

[20] Jacek Becla and Daniel L. Wang. Lessons learned from managing a petabyte. In Proceedings

of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), Asilomar,

California, 2005. 2

[21] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science, 323(5919):1297–

1298, 2009. 2

[22] Monica Bianchini, Marco Gori, and Franco Scarselli. Inside PageRank. ACM Transactions

on Internet Technology, 5(1):92–128, 2005. DOI: 10.1145/1052934.1052938 98, 100

[23] Jorge Luis Borges. Collected Fictions (translated by Andrew Hurley). Penguin, 1999. 5

[24] Léon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors,

Advanced Lectures on Machine Learning, Lecture Notes in Artiﬁcial Intelligence, LNAI 3176,

pages 146–168. Springer Verlag, Berlin, 2004. DOI: 10.1007/b100712 144

[25] Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language

models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural Language Learning, pages 858–867,

Prague, Czech Republic, 2007. 4, 5, 133

[26] Thorsten Brants and Peng Xu. Distributed Language Models. Morgan & Claypool Publishers,

2010. 134

[27] Eric Brill, Jimmy Lin, Michele Banko, Susan Dumais, and Andrew Ng. Data-intensive

question answering. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001),

pages 393–400, Gaithersburg, Maryland, 2001. 4

[28] Frederick P. Brooks. The Mythical Man-Month: Essays on Software Engineering, Anniversary

Edition. Addison-Wesley, Reading, Massachusetts, 1995. 12

[29] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer.

The mathematics of statistical machine translation: Parameter estimation. Computational

Linguistics, 19(2):263–311, 1993. 130

[30] Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval:

Implementing and Evaluating Search Engines. MIT Press, Cambridge, Massachusetts, 2010.

78, 83

152 7. CLOSINGREMARKS

[31] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic.

Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering

computing as the 5th utility. Future Generation Computer Systems, 25(6):599–616, 2009.

DOI: 10.1016/j.future.2008.12.001 6

[32] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide

high I/O data rates. Computer Systems, 4(4):405–436, 1991. 29

[33] Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. Findings of the

2009 workshop on statistical machine translation. In Proceedings of the Fourth Workshop on

Statistical Machine Translation (StatMT ’09), pages 1–28, Athens, Greece, 2009. 136

[34] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael

Burrows, Tushar Chandra, AndrewFikes, and Robert Gruber. Bigtable: Adistributed storage

system for structured data. In Proceedings of the 7th Symposium on Operating System Design

and Implementation (OSDI 2006), pages 205–218, Seattle, Washington, 2006. 24, 145

[35] Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques

for language modeling. In Proceedings of the 34th Annual Meeting of the Association

for Computational Linguistics (ACL 1996), pages 310–318, Santa Cruz, California, 1996.

DOI: 10.3115/981863.981904 5, 133

[36] Hung chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-Reduce-Merge:

Simpliﬁed relational data processing on large clusters. In Proceedings of the 2007 ACM

SIGMODInternational Conference on Management of Data, pages 1029–1040, Beijing, China,

2007. DOI: 10.1145/1247480.1247602 146

[37] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Ng, and

Kunle Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural

Information Processing Systems 19 (NIPS 2006), pages 281–288, Vancouver, British Columbia,

Canada, 2006. 141

[38] Kenneth W. Church and Patrick Hanks. Word association norms, mutual information, and

lexicography. Computational Linguistics, 16(1):22–29, 1990. 48

[39] Jonathan Cohen. Graph twiddling in a MapReduce world. Computing in Science and Engi-

neering, 11(4):29–41, 2009. DOI: 10.1109/MCSE.2009.120 103

[40] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears.

Benchmarking cloud serving systems with YCSB. In Proceedings of the First ACMSymposium

on Cloud Computing (ACM SOCC 2010), Indianapolis, Indiana, 2010. 145

[41] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms.

MIT Press, Cambridge, Massachusetts, 1990. 88

7.3. MAPREDUCEANDBEYOND 153

[42] W. Bruce Croft, Donald Meztler, andTrevor Strohman. Search Engines: Information Retrieval

in Practice. Addison-Wesley, Reading, Massachusetts, 2009. 83

[43] DavidCuller, RichardKarp, DavidPatterson, Abhijit Sahay, Klaus Erik Schauser, Eunice San-

tos, Ramesh Subramonian, andThorsten von Eicken. LogP: Towards a realistic model of par-

allel computation. ACMSIGPLANNotices, 28(7):1–12, 1993. DOI: 10.1145/173284.155333

15

[44] Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun. A practical part-of-speech

tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, pages

133–140, Trento, Italy, 1992. DOI: 10.3115/974499.974523 114

[45] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing on large clusters.

In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI

2004), pages 137–150, San Francisco, California, 2004. 1, 24, 25

[46] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing on large clusters.

Communications of the ACM, 51(1):107–113, 2008. DOI: 10.1145/1327452.1327492 2

[47] Jeffrey Dean and Sanjay Ghemawat. MapReduce: A ﬂexible data processing tool. Commu-

nications of the ACM, 53(1):72–77, 2010. DOI: 10.1145/1629175.1629198 59

[48] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash

Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dy-

namo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium

on Operating Systems Principles (SOSP 2007), pages 205–220, Stevenson, Washington, 2007.

145

[49] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximumlikelihood fromincom-

plete data via the EMalgorithm. Journal of the Royal Statistical Society. Series B(Methodological),

39(1):1–38, 1977. 108

[50] David J. DeWitt and Jim Gray. Parallel database systems: The future of high

performance database systems. Communications of the ACM, 35(6):85–98, 1992.

DOI: 10.1145/129888.129894 12

[51] David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker,

and David Wood. Implementation techniques for main memory database systems. ACM

SIGMOD Record, 14(2):1–8, 1984. DOI: 10.1145/971697.602261 63

[52] Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by

conﬁdence-weighted parameter combination. Machine Learning, 79:123–149, 2010.

DOI: 10.1007/s10994-009-5148-0 144

154 7. CLOSINGREMARKS

[53] Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web question

answering: Is more always better? In Proceedings of the 25th Annual International ACMSIGIR

Conference on Research and Development in Information Retrieval (SIGIR 2002), pages 291–

298, Tampere, Finland, 2002. DOI: 10.1145/564376.564428 4

[54] Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin. Fast, easy, and cheap: Construction

of statistical machine translationmodels withMapReduce. InProceedings of theThirdWorkshop

on Statistical Machine Translation at ACL 2008, pages 199–207, Columbus, Ohio, 2008. 47,

135, 136

[55] John R. Firth. A synopsis of linguistic theory 1930–55. In Studies in Linguistic Analysis,

Special Volume of the Philological Society, pages 1–32. Blackwell, Oxford, 1957. 48

[56] Qin Gao and Stephan Vogel. Training phrase-based machine translation models on the

cloud: Open source machine translation toolkit Chaski. The Prague Bulletin of Mathematical

Linguistics, 93:37–46, 2010. DOI: 10.2478/v10108-010-0004-8 138

[57] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In

Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pages

29–43, Bolton Landing, New York, 2003. DOI: 10.1145/945445.945450 29

[58] Seth Gilbert and Nancy Lynch. Brewer’s Conjecture and the feasibility of consis-

tent, available, partition-tolerant web services. ACM SIGACT News, 33(2):51–59, 2002.

DOI: 10.1145/564585.564601 32

[59] Michelle Girvan and Mark E. J. Newman. Community structure in social and biolog-

ical networks. Proceedings of the National Academy of Science, 99(12):7821–7826, 2002.

DOI: 10.1073/pnas.122653799 85

[60] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Introduction to Parallel

Computing. Addison-Wesley, Reading, Massachusetts, 2003. 14, 103

[61] Mark S. Granovetter. The strength of weak ties. The American Journal of Sociology, 78(6):1360–

1380, 1973. DOI: 10.1086/225469 92

[62] Mark S. Granovetter. The strength of weak ties: A network theory revisited. Sociological

Theory, 1:201–233, 1983. DOI: 10.2307/202051 92

[63] Zoltán Gyöngyi and Hector Garcia-Molina. Web spam taxonomy. In Proceedings of the First

International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005), pages

39–47, Chiba, Japan, 2005. 96

[64] Per Hage and Frank Harary. Island Networks: Communication, Kinship, and Classiﬁcation

Structures in Oceania. Cambridge University Press, Cambridge, England, 1996. 86

7.3. MAPREDUCEANDBEYOND 155

[65] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data.

Communications of the ACM, 24(2):8–12, 2009. DOI: 10.1109/MIS.2009.36 5

[66] James Hamilton. On designing and deploying Internet-scale services. In Proceedings of the

21st Large Installation System Administration Conference (LISA ’07), pages 233–244, Dallas,

Texas, 2007. 10

[67] James Hamilton. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost, low

power servers for Internet-scale services. In Proceedings of the Fourth Biennial Conference on

Innovative Data Systems Research (CIDR 2009), Asilomar, California, 2009. 9, 10

[68] Jeff Hammerbacher. Information platforms and the rise of the data scientist. InToby Segaran

andJeff Hammerbacher, editors, Beautiful Data, pages 73–84. O’Reilly, Sebastopol, California,

2009. 6, 59, 146

[69] Zelig S. Harris. Mathematical Structures of Language. Wiley, New York, 1968. 48

[70] Md. Raﬁul Hassan and Baikunth Nath. Stock market forecasting using hidden Markov

models: A new approach. In Proceedings of the 5th International Conference on Intelli-

gent Systems Design and Applications (ISDA ’05), pages 192–196, Wroclaw, Poland, 2005.

DOI: 10.1109/ISDA.2005.85 114

[71] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars:

A MapReduce framework on graphics processors. In Proceedings of the 17th International

Conference on Parallel Architectures and Compilation Techniques (PACT 2008), pages 260–269,

Toronto, Ontario, Canada, 2008. DOI: 10.1145/1454115.1454152 20

[72] Tony Hey, Stewart Tansley, and Kristin Tolle. The Fourth Paradigm: Data-Intensive Scientiﬁc

Discovery. Microsoft Research, Redmond, Washington, 2009. 3

[73] Tony Hey, Stewart Tansley, and Kristin Tolle. Jim Gray on eScience: A transformed scientiﬁc

method. In Tony Hey, Stewart Tansley, and Kristin Tolle, editors, The Fourth Paradigm:

Data-Intensive Scientiﬁc Discovery. Microsoft Research, Redmond, Washington, 2009. 3

[74] John Howard, Michael Kazar, Sherri Menees, David Nichols, Mahadev Satyanarayanan,

Robert Sidebotham, and Michael West. Scale and performance in a distributed ﬁle system.

ACMTransactions on Computer Systems, 6(1):51–81, 1988. DOI: 10.1145/35037.35059 29

[75] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Dis-

tributed data-parallel programs from sequential building blocks. In Proceedings of the ACM

SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007), pages 59–72,

Lisbon, Portugal, 2007. DOI: 10.1145/1272998.1273005 145

[76] Adam Jacobs. The pathologies of big data. ACM Queue, 7(6), 2009.

DOI: 10.1145/1563821.1563874 11

156 7. CLOSINGREMARKS

[77] Joseph JaJa. An Introduction to Parallel Algorithms. Addison-Wesley, Reading, Massachusetts,

1992. 14, 103

[78] Frederick Jelinek. Statistical methods for speech recognition. MIT Press, Cambridge, Mas-

sachusetts, 1997. 112, 114, 120

[79] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Pearson, Upper Saddle

River, New Jersey, 2009. 5

[80] U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec.

HADI: Fast diameter estimation and mining in massive graphs with Hadoop. Technical

Report CMU-ML-08-117, School of Computer Science, Carnegie Mellon University, 2008.

103

[81] U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. PEGASUS: A peta-scale

graph mining system—implementation and observations. In Proceedings of the 2009 Ninth

IEEEInternational Conference on Data Mining (ICDM2009), pages 229–238, Miami, Floria,

2009. DOI: 10.1109/ICDM.2009.14 103

[82] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for Map-

Reduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms

(SODA 2010), Austin, Texas, 2010. 15

[83] Aaron Kimball, Sierra Michels-Slettvet, and Christophe Bisciglia. Cluster computing

for Web-scale data processing. In Proceedings of the 39th ACM Technical Symposium

on Computer Science Education (SIGCSE 2008), pages 116–120, Portland, Oregon, 2008.

DOI: 10.1145/1352135.1352177 71

[84] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,

46(5):604–632, 1999. DOI: 10.1145/324133.324140 65, 95

[85] Philipp Koehn. Statistical Machine Translation. Cambridge University Press, Cambridge,

England, 2010. 130, 133

[86] Philipp Koehn, Franz J. Och, and Daniel Marcu. Statistical phrase-based translation. In

Proceedings of the 2003 Human Language Technology Conference of the North American Chapter

of the Association for Computational Linguistics (HLT/NAACL2003), pages 48–54, Edmonton,

Alberta, Canada, 2003. DOI: 10.3115/1073445.1073462 131

[87] JohnD. Lafferty, AndrewMcCallum, andFernandoPereira. Conditional randomﬁelds: Prob-

abilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth

International Conference on Machine Learning (ICML ’01), pages 282–289, San Francisco,

California, 2001. 138

7.3. MAPREDUCEANDBEYOND 157

[88] Ronny Lempel and Shlomo Moran. SALSA: The Stochastic Approach for Link-

Structure Analysis. ACM Transactions on Information Systems, 19(2):131–160, 2001.

DOI: 10.1145/382979.383041 65, 95

[89] Abby Levenberg, Chris Callison-Burch, and Miles Osborne. Stream-based translation mod-

els for statistical machine translation. In Proceedings of the 11th Annual Conference of the North

American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los

Angeles, California, 2010. 145

[90] Abby Levenberg and Miles Osborne. Stream-based randomised language models for SMT.

In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing,

pages 756–764, Singapore, 2009. 145

[91] Adam Leventhal. Triple-parity RAID and beyond. ACM Queue, 7(11), 2009.

DOI: 10.1145/1661785.1670144 2

[92] Jimmy Lin. An exploration of the principles underlying redundancy-based factoid

question answering. ACM Transactions on Information Systems, 27(2):1–55, 2007.

DOI: 10.1145/1229179.1229180 4

[93] Jimmy Lin. Exploring large-data issues in the curriculum: A case study with MapReduce. In

Proceedings of the Third Workshop on Issues inTeaching Computational Linguistics (TeachCL-08)

at ACL 2008, pages 54–61, Columbus, Ohio, 2008. 71

[94] Jimmy Lin. Scalable language processing algorithms for the masses: Acase study incomputing

word co-occurrence matrices with MapReduce. In Proceedings of the 2008 Conference on

Empirical Methods in Natural Language Processing (EMNLP2008), pages 419–428, Honolulu,

Hawaii, 2008. 47, 51

[95] Jimmy Lin, Anand Bahety, Shravya Konda, and Samantha Mahindrakar. Low-latency, high-

throughput access to static global resources within the Hadoop framework. Technical Report

HCIL-2009-01, University of Maryland, College Park, Maryland, January 2009. 63

[96] Dong C. Liu, Jorge Nocedal, Dong C. Liu, and Jorge Nocedal. On the limited memory

BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528,

1989. DOI: 10.1007/BF01589116 139

[97] Adam Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–49, 2008.

DOI: 10.1145/1380584.1380586 130

[98] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,

Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. In

Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC 2009),

page 6, Calgary, Alberta, Canada, 2009. DOI: 10.1145/1583991.1584010 86, 145

158 7. CLOSINGREMARKS

[99] Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn,

Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing.

In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data,

Indianapolis, Indiana, 2010. DOI: 10.1145/1582716.1582723 86, 145

[100] Robert Malouf. A comparison of algorithms for maximum entropy parameter estimation.

In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), pages

49–55, Taipei, Taiwan, 2002. DOI: 10.3115/1118853.1118871 140

[101] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. An Introduction to

Information Retrieval. Cambridge University Press, Cambridge, England, 2008. 42, 83

[102] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language

Processing. MIT Press, Cambridge, Massachusetts, 1999. 5, 133

[103] Elaine R. Mardis. The impact of next-generation sequencing technology on genetics. Trends

in Genetics, 24(3):133–141, 2008. DOI: 10.1016/j.tig.2007.12.007 2

[104] Michael D. McCool. Scalable programming models for massively multicore processors.

Proceedings of the IEEE, 96(5):816–831, 2008. DOI: 10.1109/JPROC.2008.917731 13, 145

[105] Marshall K. McKusick and Sean Quinlan. GFS: Evolution on fast-forward. ACM Queue,

7(7), 2009. DOI: 10.1145/1594204.1594206 32

[106] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory hierarchy

performance for irregular applications using data and computation reorderings. International

Journal of Parallel Programming, 29(3):217–247, 2001. DOI: 10.1023/A:1011119519789 101

[107] Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy. Building enriched document

representations using aggregated anchor text. In Proceedings of the 32nd Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009),

pages 219–226, 2009. DOI: 10.1145/1571941.1571981 68, 88

[108] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden Markov model informa-

tion retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference

on Research and Development in Information Retrieval (SIGIR1999), pages 214–221, Berkeley,

California, 1999. DOI: 10.1145/312624.312680 114

[109] Alistair Moffat, William Webber, and Justin Zobel. Load balancing for term-distributed

parallel retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval (SIGIR 2006), pages 348–355, Seattle,

Washington, 2006. DOI: 10.1145/1148170.1148232 82

7.3. MAPREDUCEANDBEYOND 159

[110] Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy for text

classiﬁcation. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information

Filtering, pages 61–67, Stockholm, Sweden, 1999. 138

[111] Daniel Nurmi, Rich Wolski, Chris Grzegorczyk, Graziano Obertelli, Sunil Soman, Lamia

Youseff, and Dmitrii Zagorodnov. The Eucalyptus open-source cloud-computing system. In

Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid,

pages 124–131, Washington, D.C., 2009. DOI: 10.1109/CCGRID.2009.93 7

[112] Franz J. Och and Hermann Ney. A systematic comparison of various statistical alignment

models. Computational Linguistics, 29(1):19–51, 2003. DOI: 10.1162/089120103321337421

135

[113] Christopher Olston and Marc Najork. Web crawling. Foundations and Trends in Information

Retrieval, 4(3):175–246, 2010. DOI: 10.1561/1500000017 67

[114] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and AndrewTomkins.

Pig Latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM

SIGMOD International Conference on Management of Data, pages 1099–1110, Vancouver,

British Columbia, Canada, 2008. DOI: 10.1145/1376616.1376726 59, 146

[115] Kunle Olukotun and Lance Hammond. The future of microprocessors. ACMQueue, 3(7):27–

34, 2005. DOI: 10.1145/1095408.1095418 13

[116] Sean Owen and Robin Anil. Mahout in Action. Manning Publications Co., Greenwich,

Connecticut, 2010. 141

[117] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation

ranking: Bringing order to the Web. Stanford Digital Library Working Paper SIDL-WP-

1999-0120, Stanford University, 1999. 65, 95, 100

[118] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends

in Information Retrieval, 2(1–2):1–135, 2008. DOI: 10.1561/1500000011 3

[119] David A. Patterson. The data center is the computer. Communications of the ACM, 52(1):105,

2008. 14

[120] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel

Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis.

In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data,

pages 165–178, Providence, Rhode Island, 2009. DOI: 10.1145/1559845.1559865 59

[121] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Streaming ﬁrst story detection with

application to Twitter. In Proceedings of the 11th Annual Conference of the North American

160 7. CLOSINGREMARKS

Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los Angeles,

California, 2010. 145

[122] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel

analysis with Sawzall. Scientiﬁc Programming Journal, 13(4):277–298, 2005. 146

[123] Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. Failure trends in a large

disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage

Technologies (FAST 2007), San Jose, California, 2008. 10, 26

[124] Xiaoguang Qi and Brian D. Davison. Web page classiﬁcation: Features and algorithms. ACM

Computing Surveys, 41(2), 2009. DOI: 10.1145/1459352.1459357 82

[125] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in

speech recognition. In Readings in Speech Recognition, pages 267–296. Morgan Kaufmann

Publishers, San Francisco, California, 1990. DOI: 10.1109/5.18626 115, 120

[126] M. Mustafa Raﬁque, Benjamin Rose, Ali R. Butt, and Dimitrios S. Nikolopoulos. Supporting

MapReduce on large-scale asymmetric multi-core clusters. ACM Operating Systems Review,

43(2):25–34, 2009. DOI: 10.1145/1531793.1531800 20

[127] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos

Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings

of the 13th International Symposiumon High-Performance Computer Architecture (HPCA 2007),

pages 205–218, Phoenix, Arizona, 2007. DOI: 10.1109/HPCA.2007.346181 20

[128] Delip Rao and David Yarowsky. Ranking and semi-supervised classiﬁcation on large scale

graphs using Map-Reduce. In Proceedings of the ACL/IJCNLP 2009 Workshop on Graph-Based

Methods for Natural Language Processing (TextGraphs-4), Singapore, 2009. 103

[129] Michael A. Rappa. The utility business model and the future of computing services. IBM

Systems Journal, 34(1):32–42, 2004. DOI: 10.1147/sj.431.0032 6

[130] Sheldon M. Ross. Stochastic processes. Wiley, New York, 1996. 114

[131] Thomas Sandholm and Kevin Lai. MapReduce optimization using regulated dynamic pri-

oritization. In Proceedings of the Eleventh International Joint Conference on Measurement and

Modeling of Computer Systems (SIGMETRICS’09), pages 299–310, Seattle, Washington, 2009.

DOI: 10.1145/1555349.1555384 25

[132] Michael Schatz. High Performance Computing for DNASequence Alignment and Assembly. PhD

thesis, University of Maryland, College Park, 2010. 103

[133] Frank Schmuck and Roger Haskin. GPFS: A shared-disk ﬁle system for large computing

clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies, pages

231–244, Monterey, California, 2002. 29

7.3. MAPREDUCEANDBEYOND 161

[134] Donovan A. Schneider and David J. DeWitt. A performance evaluation of four parallel join

algorithms in a shared-nothing multiprocessor environment. In Proceedings of the 1989 ACM

SIGMODInternational Conference on Management of Data, pages 110–121, Portland, Oregon,

1989. DOI: 10.1145/67544.66937 60

[135] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM errors in the wild:

A large-scale ﬁeld study. In Proceedings of the Eleventh International Joint Conference on

Measurement and Modeling of Computer Systems (SIGMETRICS ’09), pages 193–204, Seattle,

Washington, 2009. DOI: 10.1145/1555349.1555372 10, 26

[136] Hinrich Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97–

123, 1998. 48

[137] Hinrich Schütze and Jan O. Pedersen. A cooccurrence-based thesaurus and two applica-

tions to information retrieval. Information Processing and Management, 33(3):307–318, 1998.

DOI: 10.1016/S0306-4573(96)00068-4 48

[138] Satoshi Sekine and Elisabete Ranchhod. Named Entities: Recognition, Classiﬁcation and Use.

John Benjamins, Amsterdam, The Netherlands, 2009. 3

[139] Kristie Seymore, AndrewMccallum, and Ronald Rosenfeld. Learning hidden Markov model

structure for information extraction. In Proceedings of the AAAI-99 Workshop on Machine

Learning for Information Extraction, pages 37–42, Orlando, Florida, 1999. 114

[140] Fei Sha and Fernando Pereira. Shallow parsing with conditional random ﬁelds. In Proceed-

ings of the 2003 Human Language Technology Conference of the North American Chapter of the

Association for Computational Linguistics (HLT/NAACL 2003), pages 134–141, Edmonton,

Alberta, Canada, 2003. DOI: 10.3115/1073445.1073473 140

[141] Noah Smith. Log-linear models. http://www.cs.cmu.edu/˜nasmith/papers/smith.

tut04.pdf, 2004. 140

[142] Christopher Southan and Graham Cameron. Beyond the tsunami: Developing the infras-

tructure to deal with life sciences data. In Tony Hey, Stewart Tansley, and Kristin Tolle,

editors, The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. Microsoft Research, Red-

mond, Washington, 2009. 2

[143] Mario Stanke and Stephan Waack. Gene prediction with a hidden Markov model

and a new intron submodel. Bioinformatics, 19 Suppl 2:ii215–225, October 2003.

DOI: 10.1093/bioinformatics/btg1080 114

[144] Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew

Pavlo, and Alexander Rasin. MapReduce and parallel DBMSs: Friends or foes? Communi-

cations of the ACM, 53(1):64–71, 2010. DOI: 10.1145/1629175.1629197 59

162 7. CLOSINGREMARKS

[145] Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar, JimGray, Don Slutz, and Robert J. Brunner.

Designing and mining multi-terabyte astronomy archives: The Sloan Digital Sky Survey.

SIGMOD Record, 29(2):451–462, 2000. DOI: 10.1145/335191.335439 2

[146] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson. Data-intensive ﬁle systems for

Internet services: Arose by any other name…. Technical Report CMU-PDL-08-114, Parallel

Data Laboratory, Carnegie Mellon University, 2008. 29

[147] Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee. Frangipani: A scalable

distributed ﬁle system. In Proceedings of the 16th ACMSymposiumon Operating Systems Princi-

ples (SOSP 1997), pages 224–237, Saint-Malo, France, 1997. DOI: 10.1145/268998.266694

29

[148] Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM,

33(8):103–111, 1990. DOI: 10.1145/79173.79181 13, 14, 15, 86, 145

[149] Luis M. Vaquero, Luis Rodero-Merino, Juan Caceres, and Maik Lindner. A break in the

clouds: Towards a cloud deﬁnition. ACM SIGCOMM Computer Communication Review,

39(1):50–55, 2009. DOI: 10.1145/1496091.1496100 6

[150] Stephan Vogel, Hermann Ney, and Christoph Tillmann. HMM-based word alignment

in statistical translation. In Proceedings of the 16th International Conference on Com-

putational Linguistics (COLING 1996), pages 836–841, Copenhagen, Denmark, 1996.

DOI: 10.3115/993268.993313 114, 135

[151] Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and EdwardY. Chang. PLDA: Parallel

latent Dirichlet allocation for large-scale applications. In Proceedings of the Fifth International

Conference on Algorithmic Aspects in Information and Management (AAIM 2009), pages 301–

314, San Francisco, California, 2009. DOI: 10.1007/978-3-642-02158-9_26 141

[152] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of ‘small-world’ networks.

Nature, 393:440–442, 1998. DOI: 10.1038/30918 92

[153] Xingzhi Wen and Uzi Vishkin. FPGA-based prototype of a PRAM-On-Chip processor.

In Proceedings of the 5th Conference on Computing Frontiers, pages 55–66, Ischia, Italy, 2008.

DOI: 10.1145/1366230.1366240 14

[154] Tom White. Hadoop: The Deﬁnitive Guide. O’Reilly, Sebastopol, California, 2009. 15, 23

[155] Eugene Wigner. The unreasonable effectiveness of mathematics in the natu-

ral sciences. Communications in Pure and Applied Mathematics, 13(1):1–14, 1960.

DOI: 10.1002/cpa.3160130102 5

7.3. MAPREDUCEANDBEYOND 163

[156] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and

Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, California,

1999. DOI: 10.1023/A:1011472308196 69, 77, 78, 83

[157] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word variants.

ACMTransactions on Information Systems, 16(1):61–81, 1998. DOI: 10.1145/267954.267957

48

[158] Rui Xu and Donald Wunsch II. Survey of clustering algorithms. IEEETransactions on Neural

Networks, 16(3):645–678, 2005. DOI: 10.1109/TNN.2005.845141 86

[159] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar

Gunda, and Jon Currey. DryadLINQ: A system for general-purpose distributed data-parallel

computing using a high-level language. In Proceedings of the 8th Symposium on Operating

System Design and Implementation (OSDI 2008), pages 1–14, San Diego, California, 2008.

145

[160] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker,

and Ion Stoica. Job scheduling for multi-user MapReduce clusters. Technical Report

UCB/EECS-2009-55, Electrical Engineering and Computer Sciences, University of Cal-

ifornia at Berkeley, 2009. 25

[161] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. Improving

MapReduce performance in heterogeneous environments. In Proceedings of the 8th Sympo-

sium on Operating System Design and Implementation (OSDI 2008), pages 29–42, San Diego,

California, 2008. 25

[162] Justin Zobel and Alistair Moffat. Inverted ﬁles for text search engines. ACM Computing

Surveys, 38(6):1–56, 2006. 78, 83

165

Authors’ Biographies

JIMMY LIN

Jimmy Lin is an Associate Professor in the iSchool (College of Information Studies) at the Uni-

versity of Maryland, College Park. He directs the recently-formed Cloud Computing Center, an

interdisciplinary group that explores the many aspects of cloud computing as it impacts technol-

ogy, people, and society. Lin’s research lies at the intersection of natural language processing and

information retrieval, with a recent emphasis on scalable algorithms and large-data processing. He

received his Ph.D. from MIT in Electrical Engineering and Computer Science in 2004.

CHRIS DYER

Chris Dyer is graduating with a Ph.D. in Linguistics from the University of Maryland, College

Park in June, 2010 and will be joining the Language Technologies Institute at Carnegie Mellon

University as a postdoctoral researcher. His research interests include statistical machine translation

and machine learning, and he has served as a reviewer for numerous conferences and journals in the

areas of natural language processing and computational linguistics. He ﬁrst became acquainted with

MapReduce in 2007 using Hadoop, version 0.13.0, and gained further experience with MapReduce

during an internship with Google Research in 2008.

Data-Intensive Text Processing with MapReduce

The series consists of 50. computational linguistics. Ruifeng Xu. and Joakim Nivre 2009 Statistical Language Models for Information Retrieval ChengXiang Zhai 2008 . and spoken language understanding.Synthesis Lectures on Human Language Technologies Editor Graeme Hirst. and on topics that combine two or more HLT subﬁelds. and Zheng-sheng Zhang 2009 Introduction to Linguistic Annotation and Text Analytics Graham Wilcock 2009 Dependency Parsing Sandra Kübler. University of Toronto Synthesis Lectures on Human Language Technologies is edited by Graeme Hirst of the University of Toronto. Emphasis is on important new techniques. Daniel Gildea. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer 2010 Semantic Role Labeling Martha Palmer. Ryan McDonald. and Nianwen Xue 2010 Spoken Dialogue Systems Kristiina Jokinen and Michael McTear 2009 Introduction to Chinese Natural Language Processing Kam-Fai Wong.to 150-page monographs on topics relating to natural language processing. Wenjie Li. information retrieval. on new applications.

or any other except for brief quotations in printed reviews. recording.Copyright © 2010 by Morgan & Claypool All rights reserved. stored in a retrieval system. or transmitted in any form or by any means—electronic. No part of this publication may be reproduced.morganclaypool. mechanical. without the prior permission of the publisher. photocopy.2200/S00274ED1V01Y201006HLT007 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES Lecture #7 Series Editor: Graeme Hirst. Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer www. University of Toronto Series ISSN Synthesis Lectures on Human Language Technologies Print 1947-4040 Electronic 1947-4059 .com ISBN: 9781608453429 ISBN: 9781608453436 paperback ebook DOI 10.

Data-Intensive Text Processing with MapReduce Jimmy Lin and Chris Dyer University of Maryland SYNTHESIS LECTURES ON HUMAN LANGUAGE TECHNOLOGIES #7 M &C Morgan & cLaypool publishers .

ABSTRACT

Our world is being revolutionized by data-driven methods: access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader “think in MapReduce”, but also discusses limitations of the programming model as well.

KEYWORDS

Hadoop, parallel and distributed programming, algorithm design, text processing, natural language processing, information retrieval, machine learning

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 1.2 1.3 1.4 Computing in the Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Big Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Why Is This Different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 What This Book Is Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2

MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 2.2 2.3 2.4 2.5 2.6 2.7 Functional Programming Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Mappers and Reducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Partitioners and Combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26 The Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3

**MapReduce Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
**

3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.1 Combiners and In-Mapper Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39 3.1.2 Algorithmic Correctness with Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 3.3 3.4 3.5 Pairs and Stripes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Secondary Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Relational Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 Reduce-Side Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 3.5.2 Map-Side Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

viii

3.5.3 Memory-Backed Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4

**Inverted Indexing for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
**

4.1 4.2 4.3 4.4 4.5 Web Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Inverted Indexing: Revised Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Index Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.1 Byte-Aligned and Word-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5.2 Bit-Aligned Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.3 Postings Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 4.7 What About Retrieval? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

5

Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85

5.1 5.2 5.3 5.4 5.5 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Summary and Additional Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6

**EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
**

6.1 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.2 A Latent Variable Marble Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110 6.1.3 MLE with Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.4 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.1.5 An EM Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113 6.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.1 Three Questions for Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 115

. . . . . . . . . . . 125 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6. .1 Statistical Phrase-Based Translation .5. . . . . . . . . . . . . . . . . . . . . . . . . . .131 6. . . . . . 117 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . .3 Limitations of MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Forward-Backward Training: Summary . . . . . . . . . . . . . . . . . . . . . . . .5 6. 143 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4. .2. . . . . . . . . . . . 134 6. . . . . . . . . . . . . . . . 149 Authors’ Biographies . . . . . . . . . .4. . . . . . . . . . . . . . . . . .4 EM in MapReduce . . . . . . . . . . . . . . . . . .2 Brief Digression: Language Modeling with MapReduce . . . . . . . . . . . . . . . . . . . . . . . 146 Bibliography . .3. . 141 7 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Gradient-Based Optimization and Log-Linear Models . . . . . . . 138 Summary and Additional Readings . . . . . . . . . . . . . . . . .CONTENTS ix 6. .3 The Viterbi Algorithm . . . . . . . . . . . . . . . . . . .1 HMM Training in MapReduce . . . . . . .4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6. 130 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 . . . . . . . . . . . . . . . . .6 EM-Like Algorithms . . . . . 145 MapReduce and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 7.4 Parameter Estimation for HMMs . . . . . . . . . 125 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 The Forward Algorithm . . . . . . . . . . . . . . . . . . . . . . . .2. .2 7. . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . .2. . . . . . . . . . . . . . . . . . . . . . . 118 6. . . . . . . . . . . 135 6. .3 6. . . . . . . 143 Alternative Computing Paradigms . . . . .126 Case Study: Word Alignment for Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

He dedicates this book to Joshua and Jacob.Tom White. This work was made possible by the Google and IBM Academic Cloud Computing Initiative (ACCI) and the National Science Foundation’s Cluster Exploratory (CLuE) program. Jimmy Lin and Chris Dyer May 2010 .Acknowledgments The ﬁrst author is grateful to Esther and Kiri for their loving support. as well as numerous other individuals who have commented on earlier drafts of this book. ﬁndings. the new joys of his life. conclusions. The second author would like to thank Herb for putting up with his disorderly living habits and Philip for being a very indulgent linguistics advisor. under award IIS-0836560. and also award IIS-0916043. Miles Osborne. or recommendations expressed in this book are those of the authors and do not necessarily reﬂect the views of the sponsors. Any opinions. We are grateful to Jeff Dean.

1

CHAPTER

1

Introduction

MapReduce [45] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with signiﬁcant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce. Given this focus, it makes sense to start with the most basic question: Why? There are many answers to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that real-world systems must grapple with. Second, across a wide range of text processing applications, more data translates into more effective algorithms, and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are deﬁned by vast repositories of data, both public and private. Therefore, any practical application must be able to scale up to datasets of interest. For many, this means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around gathering, analyzing, monitoring, ﬁltering, searching, or organizing web content must tackle largedata problems: “web-scale” processing is practically synonymous with data-intensive processing.This observation applies not only to well-established internet companies, but also countless startups and niche players as well. Just think, how many companies do you know that start their pitch with “we’re going to harvest information on the web and…”? Another strong area of growth is the analysis of user behavior data. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. In fact, logging user behavior generates so much data that many organizations simply can’t cope with the volume, and either turn the functionality off or throw away data after some time. This represents lost opportunities, as there is a broadly held belief that great value lies in insights derived from mining such data. Knowing what users look at, what they click on, how much time they spend on a web page, etc., leads to better business decisions and competitive advantages. Broadly, this is known as business intelligence, which encompasses a wide range of technologies including data warehousing, data mining, and analytics. How much data are we talking about? A few examples: Google grew from processing 100 terabytes of data a day with MapReduce in 2004 [45] to processing 20 petabytes a day with MapReduce

2

1. INTRODUCTION

in 2008 [46]. In April 2009, a blog post1 was written about eBay’s two enormous data warehouses: one with 2 petabytes of user data, and the other with 6.5 petabytes of user data spanning 170 trillion records and growing by 150 billion new records per day. Shortly thereafter, Facebook revealed2 similarly impressive numbers, boasting of 2.5 petabytes of user data, growing at about 15 terabytes per day. Petabyte datasets are rapidly becoming the norm, and the trends are clear: our ability to store data is fast overwhelming our ability to process what we store. More distressing, increases in capacity are outpacing improvements in bandwidth such that our ability to even read back what we store is deteriorating [91]. Disk capacities have grown from tens of megabytes in the mid-1980s to about a couple of terabytes today (several orders of magnitude). On the other hand, latency and bandwidth have improved relatively little: in the case of latency, perhaps 2× improvement during the last quarter century, and in the case of bandwidth, perhaps 50×. Given the tendency for individuals and organizations to continuously ﬁll up whatever capacity is available, large-data problems are growing increasingly severe. Moving beyond the commercial sphere, many have recognized the importance of data management in many scientiﬁc disciplines, where petabyte-scale datasets are also becoming increasingly common [21]. For example: • The high-energy physics community was already describing experiences with petabyte-scale databases back in 2005 [20]. Today, the Large Hadron Collider (LHC) near Geneva is the world’s largest particle accelerator, designed to probe the mysteries of the universe, including the fundamental nature of matter, by recreating conditions shortly following the Big Bang. When it becomes fully operational, the LHC will produce roughly 15 petabytes of data a year.3 • Astronomers have long recognized the importance of a “digital observatory” that would support the data needs of researchers across the globe—the Sloan Digital Sky Survey [145] is perhaps the most well known of these projects. Looking into the future, the Large Synoptic Survey Telescope (LSST) is a wide-ﬁeld instrument that is capable of observing the entire sky every few days. When the telescope comes online around 2015 in Chile, its 3.2 gigapixel primary camera will produce approximately half a petabyte of archive images every month [19]. • The advent of next-generation DNA sequencing technology has created a deluge of sequence data that needs to be stored, organized, and delivered to scientists for further study. Given the fundamental tenant in modern genetics that genotypes explain phenotypes, the impact of this technology is nothing less than transformative [103]. The European Bioinformatics Institute (EBI), which hosts a central repository of sequence data called EMBL-bank, has increased storage capacity from 2.5 petabytes in 2008 to 5 petabytes in 2009 [142]. Scientists are predicting that, in the not-so-distant future, sequencing an individual’s genome will be no more complex than getting a blood test today—ushering a new era of personalized medicine, where interventions can be speciﬁcally targeted for an individual.

1 http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/ 2 http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ 3 http://public.web.cern.ch/public/en/LHC/Computing-en.html

that of the three components discussed above (data. relational and graph data). Data are called corpora (singular. empirical approach. But why can’t we have our cake and eat it too? Why not both sophisticated models and deep features applied to lots of data? Because inference over sophisticated models and extraction of deep features are often computationally intensive. typically involving algorithms that attempt to capture statistical regularities in data for the purposes of some task or application. Consider a simple task such as determining the correct usage of easily confusable words such as “than” and “then” in English. Another example is to automatically situate texts along a scale of “happiness”. In other areas of academia. scientiﬁc breakthroughs will be powered by advanced computing capabilities that help researchers manipulate. but touches on other types of data as well (e. corpus) by NLP researchers and collections by those from the IR community. this book is primarily concerned with processing large amounts of text. not merely a luxury or curiosity. such as the grammatical relationship between words. Superﬁcial word-level features coupled with simple models in most cases trump sophisticated models with deeper features and less data. data probably matters the most.3 Increasingly. representations of the data. or “deep” and more difﬁcult to extract. Aspects of the representations of the data are called features. Finally. as part of a grammar checker). Training data is fairly easy to come by—we can just gather a large corpus of texts and assume that most writers make correct choices . which has been applied to everything from understanding political discourse in the blogosphere to predicting the movement of stock prices. Although large data comes in a variety of forms. while the second task is an instance of what NLP researchers call named-entity detection [138]. which involves ranking documents by relevance to the user’s query. algorithms or models are applied to capture regularities in the data in terms of the extracted features for some application. features. such as the words and sequences of words themselves. a task known as sentiment analysis or opinion mining [118]. classiﬁcation. which may be “superﬁcial” and easy to extract. they don’t scale well. systems and algorithms incapable of scaling to massive real-world datasets run the danger of being dismissed as “toy systems” with limited utility. Examples include: Is this email spam or not spam? Is this word part of an address or a location? The ﬁrst task is easy to understand. One can view this as a supervised machine learning problem: we can train a classiﬁer to disambiguate between the options. There are three components to this approach: data. which is useful for local search and pinpointing locations on maps. particularly computer science. experiments. One common application. Recent work in these ﬁelds is dominated by a data-driven. There is a growing body of evidence. explore. and then apply the classiﬁer to new instances of the problem (say.. and simulations). at least in text processing.g. and mine massive datasets [72]—this has been hailed as the emerging “fourth paradigm” of science [73] (complementing theory. Another common application is to rank texts according to some criteria—search is a good example. is to sort text into categories. algorithms). The problems and solutions we discuss mostly fall into the disciplinary boundaries of natural language processing (NLP) and information retrieval (IR). and some method for capturing regularities in the data. Large data is a fact of today’s world and data-intensive processing is fast becoming a necessity.

This simple strategy works surprisingly well. They explored several classiﬁcation algorithms (the exact ones aren’t important. consider the problem of answering short. such that pronounced differences in effectiveness observed on smaller datasets basically disappeared at scale. Even if we treat . As it turns out. a question answering (QA) system would directly return the answer: John Wilkes Booth. as we will discuss in Section 6. an enormous number of parameters must still be estimated from a training corpus: potentially V n parameters. but no matter). such that pattern-matching techniques sufﬁce to extract answers accurately. the probability of a sequence of words can be decomposed into the product of n-gram probabilities. look through multiple instances of this phrase and tally up the words that appear to the left. with increasing amounts of training data. Or better yet. such as speech recognition (to determine what the speaker is more likely to have said) and machine translation (to determine which of possible translations is the most ﬂuent. by the chain rule. INTRODUCTION (the training data may be noisy. all that matters is the amount of data you have. Across many different algorithms. This problem gained interest in the late 1990s. and probabilities must be assigned to all of them. Suppose you wanted the answer to the above question. when natural language processing researchers approached the challenge with sophisticated linguistic processing techniques such as syntactic and semantic analysis. even with lots and lots of training data! Most modern language models make the Markov assumption: in a n-gram language model. Since there are inﬁnitely many possible strings. fact-based questions such as “Who shot Abraham Lincoln?” Instead of returning a list of documents that the user would then have to sort through. the conditional probability of a word is given by the n − 1 previous words. They are useful in a variety of applications. as we shall see). Thus. Nevertheless.4 1. Furthermore. estimated from a large corpus of texts.e. Banko and Brill [14] published what has become a classic paper in natural language processing exploring the effects of training data size on classiﬁcation accuracy. A language model is a probability distribution that characterizes the likelihood of observing a particular sequence of words. where V is the number of words in the vocabulary.4). Yet another example concerns smoothing in web-scale language models [25]. answers to commonly asked questions will be stated in obvious ways. and not surprisingly. This resulted in an even more controversial recommendation. researchers discovered a far simpler approach to answering such questions based on pattern matching [27. Around 2001. delivered somewhat tongue-in-cheek: we should just give up working on algorithms and simply spend our time gathering data (while waiting for computers to become faster so we can process the data). language modeling is a more challenging task than simply keeping track of which strings were seen how many times: some number of likely strings will never be encountered. the web). It capitalizes on the insight that in a very large text collection (i. the increase in accuracy was approximately linear in the log of the size of the training data. using this task as the speciﬁc example. 92]. since people make mistakes. As another example. This led to a somewhat controversial conclusion (at least at the time): machine learning algorithms really don’t matter. 53.. In 2001. and has become known as the redundancy-based approach to question answering. you can simply search for the phrase “shot Abraham Lincoln” on the web and look for what appears to its left. the accuracy of different algorithms converged. found that more data led to better accuracy.

three Google researchers summarized this data-driven philosophy in an essay titled The Unreasonable Effectiveness of Data [65]. in summary. Unlike. Brants et al. . he would say. stupid backoff didn’t work as well as Kneser-Ney smoothing on smaller corpora. authors occasionally make mistakes. Data represent the rising tide that lifts all boats—more data lead to better algorithms and systems for solving real-world problems. researchers have developed a number of smoothing techniques [35. which ultimately yielded better language models. a simpler technique on more data beat a more sophisticated technique on less data. say. 4 As an aside. most n-grams—in any language. human use of language is not constrained by succinct. 7 On Exactitude in Science [23]. There are of course rules that govern the formation of words and sentences—for example. the more observations we gather about language use.5 every word on the web as the training corpus from which to estimate the n-gram probabilities. In the same way. 6This title was inspired by a classic article titled The Unreasonable Effectiveness of Mathematics in the Natural Sciences [155]. which is the exact opposite of the data-driven approach.4 Their experiments compared a state-of-the-art approach known as Kneser-Ney smoothing [35] with another technique the authors affectionately referred to as “stupid backoff ”. Smoothing approaches vary in effectiveness. 79. hundreds. the ﬁrst answer is similar to the reason people climb mountains: because they’re there. and dealt with a corpus containing a billion words. it was simpler and could be trained on more data. and that subjects and verbs must agree in number in many languages—but real-world language is affected by a multitude of other factors as well: people invent new words and phrases all the time. So. the more accurate a description we have of language itself. is the best description of itself. is messy. that verbs appear before objects in English. just like human behavior in general. and the rest of this book is about the how.5 Not surprisingly. the interaction of subatomic particles.6 Why is this so? It boils down to the fact that language in the wild. even English—will never have been seen. Recently. translates into more effective algorithms and systems. it is interesting to observe the evolving deﬁnition of large over the years. both in terms of intrinsic and application-speciﬁc metrics. universal “laws of grammar”. In 2007. 102]. let’s tackle the how. Let’s start with the obvious observation: data-intensive processing is beyond the capability of any individual machine and requires clusters—which means that large-data problems are fundamentally about organizing computations on dozens. The Argentine writer Jorge Luis Borges wrote a famous allegorical one-paragraph story about a ﬁctional society in which the art of cartography had gotten so advanced that their maps were as big as the lands they were describing. [25] described language models trained on up to two trillion words. This is exactly what MapReduce does. But the second answer is even more compelling. However. so stupid it couldn’t possibly work. Now that we’ve addressed the why. etc. That is. why large data? In some ways. To cope with this sparseness. This Scaling to Very Very Large Corpora for Natural Language Disambiguation. in turn. which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. A similar exchange appears in Chapter XI of Sylvie and Bruno Concluded by Lewis Carroll (1893).7 The world. groups of individuals write within a shared context. or even thousands of machines. This. is somewhat ironic in that the original article lauded the beauty and elegance of mathematical models in capturing natural phenomena. Banko and Brill’s paper in 2001 was titled 5 As in.

6

1. INTRODUCTION

1.1

COMPUTING IN THE CLOUDS

For better or for worse, it is often difﬁcult to untangle MapReduce and large-data processing from the broader discourse on cloud computing. True, there is substantial promise in this new paradigm of computing, but unwarranted hype by the media and popular sources threatens its credibility in the long run. In some ways, cloud computing is simply brilliant marketing. Before clouds, there were grids,8 and before grids, there were vector supercomputers, each having claimed to be the best thing since sliced bread. So what exactly is cloud computing? This is one of those questions where 10 experts will give 11 different answers; in fact, countless papers have been written simply to attempt to deﬁne the term (e.g., [9; 31; 149], just to name a few examples). Here we offer up our own thoughts and attempt to explain how cloud computing relates to MapReduce and data-intensive processing. At the most superﬁcial level, everything that used to be called web applications has been rebranded to become “cloud applications”, which includes what we have previously called “Web 2.0” sites. In fact, anything running inside a browser that gathers and stores user-generated content now qualiﬁes as an example of cloud computing. This includes social-networking services such as Facebook, video-sharing sites such as YouTube, web-based email services such as Gmail, and applications such as Google Docs. In this context, the cloud simply refers to the servers that power these sites, and user data is said to reside “in the cloud”. The accumulation of vast quantities of user data creates large-data problems, many of which are suitable for MapReduce. To give two concrete examples: a social-networking site analyzes connections in the enormous globe-spanning graph of friendships to recommend new connections. An online email service analyzes messages and user behavior to optimize ad selection and placement. These are all large-data problems that have been tackled with MapReduce.9 Another important facet of cloud computing is what’s more precisely known as utility computing [31; 129]. As the name implies, the idea behind utility computing is to treat computing resource as a metered service, like electricity or natural gas. The idea harkens back to the days of time-sharing machines, and in truth isn’t very different from this antiquated form of computing. Under this model, a “cloud user” can dynamically provision any amount of computing resources from a “cloud provider” on demand and only pay for what is consumed. In practical terms, the user is paying for access to virtual machine instances that run a standard operating system such as Linux. Virtualization technology (e.g., [15]) is used by the cloud provider to allocate available physical

8 What is the difference between cloud computing and grid computing? Although both tackle the fundamental problem of how best

to bring computational resources to bear on large and difﬁcult problems, they start with different assumptions. Whereas clouds are assumed to be relatively homogeneous servers that reside in a datacenter or are distributed across a relatively small number of datacenters controlled by a single organization, grids are assumed to be a less tightly-coupled federation of heterogeneous resources under the control of distinct but cooperative organizations. As a result, grid computing tends to deal with tasks that are coarser-grained, and must deal with the practicalities of a federated environment, e.g., verifying credentials across multiple administrative domains. Grid computing has adopted a middleware-based approach for tackling many of these challenges. 9The ﬁrst example is Facebook, a well-known user of Hadoop, in exactly the manner as described [68]. The second is, of course, Google, which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement.

1.1. COMPUTING IN THE CLOUDS

7

resources and enforce isolation between multiple users who may be sharing the same hardware. Once one or more virtual machine instances have been provisioned, the user has full control over the resources and can use them for arbitrary computation. Virtual machines that are no longer needed are destroyed, thereby freeing up physical resources that can be redirected to other users. Resource consumption is measured in some equivalent of machine-hours and users are charged in increments thereof. Both users and providers beneﬁt in the utility computing model. Users are freed from upfront capital investments necessary to build datacenters and substantial reoccurring costs in maintaining them. They also gain the important property of elasticity—as demand for computing resources grow, for example, from an unpredicted spike in customers, more resources can be seamlessly allocated from the cloud without an interruption in service. As demand falls, provisioned resources can be released. Prior to the advent of utility computing, coping with unexpected spikes in demand was fraught with challenges: under-provision and run the risk of service interruptions, or over-provision and tie up precious capital in idle machines that are depreciating. From the utility provider point of view, this business also makes sense because large datacenters beneﬁt from economies of scale and can be run more efﬁciently than smaller datacenters. In the same way that insurance works by aggregating risk and redistributing it, utility providers aggregate the computing demands for a large number of users. Although demand may ﬂuctuate signiﬁcantly for each user, overall trends in aggregate demand should be smooth and predictable, which allows the cloud provider to adjust capacity over time with less risk of either offering too much (resulting in inefﬁcient use of capital) or too little (resulting in unsatisﬁed customers). In the world of utility computing, Amazon Web Services currently leads the way and remains the dominant player, but a number of other cloud providers populate a market that is becoming increasingly crowded. Most systems are based on proprietary infrastructure, but there is at least one, Eucalyptus [111], that is available open source. Increased competition will beneﬁt cloud users, but what direct relevance does this have for MapReduce? The connection is quite simple: processing large amounts of data with MapReduce requires access to clusters with sufﬁcient capacity. However, not everyone with large-data problems can afford to purchase and maintain clusters. This is where utility computing comes in: clusters of sufﬁcient size can be provisioned only when the need arises, and users pay only as much as is required to solve their problems. This lowers the barrier to entry for data-intensive processing and makes MapReduce much more accessible. A generalization of the utility computing concept is “everything as a service”, which is itself a new take on the age-old idea of outsourcing. A cloud provider offering customers access to virtual machine instances is said to be offering infrastructure as a service, or IaaS for short. However, this may be too low level for many users. Enter platform as a service (PaaS), which is a rebranding of what used to be called hosted services in the “pre-cloud” era. Platform is used generically to refer to any set of well-deﬁned services on top of which users can build applications, deploy content, etc. This class of services is best exempliﬁed by Google App Engine, which provides the backend datastore and API for anyone to build highly scalable web applications. Google maintains the infrastructure,

8

1. INTRODUCTION

freeing the user from having to backup, upgrade, patch, or otherwise maintain basic services such as the storage layer or the programming environment. At an even higher level, cloud providers can offer software as a service (SaaS), as exempliﬁed by Salesforce, a leader in customer relationship management (CRM) software. Other examples include outsourcing an entire organization’s email to a third party, which is commonplace today. What does this proliferation of services have to do with MapReduce? No doubt that “everything as a service” is driven by desires for greater business efﬁciencies, but scale and elasticity play important roles as well. The cloud allows seamless expansion of operations without the need for careful planning and supports scales that may otherwise be difﬁcult or cost-prohibitive for an organization to achieve. Cloud services, just like MapReduce, represent the search for an appropriate level of abstraction and beneﬁcial divisions of labor. IaaS is an abstraction over raw physical hardware—an organization might lack the capital, expertise, or interest in running datacenters, and therefore pays a cloud provider to do so on its behalf. The argument applies similarly to PaaS and SaaS. In the same vein, the MapReduce programming model is a powerful abstraction that separates the what from the how of data-intensive processing.

1.2

BIG IDEAS

Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional models of computing. In this section, we discuss a number of “big ideas” behind MapReduce. To be fair, all of these ideas have been discussed in the computer science literature for some time (some for decades), and MapReduce is certainly not the ﬁrst to adopt these ideas. Nevertheless, the engineers at Google deserve tremendous credit for pulling these various threads together and demonstrating the power of these ideas on a scale previously unheard of. Scale “out”, not “up”. For data-intensive workloads, a large number of commodity low-end servers (i.e., the scaling “out” approach) is preferred over a small number of high-end servers (i.e., the scaling “up” approach).The latter approach of purchasing symmetric multi-processing (SMP) machines with a large number of processor sockets (dozens, even hundreds) and a large amount of shared memory (hundreds or even thousands of gigabytes) is not cost effective, since the costs of such machines do not scale linearly (i.e., a machine with twice as many processors is often signiﬁcantly more than twice as expensive). On the other hand, the low-end server market overlaps with the high-volume desktop computing market, which has the effect of keeping prices low due to competition, interchangeable components, and economies of scale. Barroso and Hölzle’s recent treatise of what they dubbed “warehouse-scale computers” [18] contains a thoughtful analysis of the two approaches. The Transaction Processing Council (TPC) is a neutral, non-proﬁt organization whose mission is to establish objective database benchmarks. Benchmark data submitted to that organization are probably the closest one can get to a fair “applesto-apples” comparison of cost and performance for speciﬁc, well-deﬁned relational processing applications. Based on TPC-C benchmark results from late 2007, a low-end server platform is about four

of course. etc. Excluding storage costs. 67]. such that an . where energy consumption would be proportional to load. and conclude that a cluster of low-end servers approaches the performance of the equivalent cluster of high-end servers—the small performance gap is insufﬁcient to justify the price premium of the high-end servers.g. power distribution inefﬁciencies). Barroso and Hölzle have advocated for research and development in energyproportional machines. the price/performance advantage of the low-end server increases to about a factor of twelve. Capital costs in acquiring servers is. the comparison is more accurately between a smaller cluster of high-end machines and a larger cluster of low-end machines (network communication is unavoidable in both cases). etc. cooling fans. A survey of ﬁve thousand Google servers over a six-month period shows that servers operate most of the time at between 10% and 50% utilization [17]. cooling. energy efﬁciency has become a key issue in building warehouse-scale computers for large-data processing.g.. however.) is actually used to perform useful computations. For data-intensive applications. and therefore most existing implementations of the MapReduce programming model are designed around clusters of low-end commodity servers. and correspondingly. Therefore. BIG IDEAS 9 times more cost efﬁcient than a high-end shared memory platform from the same vendor. only one component of the total cost of delivering computing capacity. which is an energy-inefﬁcient operating region.. Adoption of industry best-practices can help datacenter operators achieve state-of-the-art efﬁciency. What if we take into account the fact that communication between nodes in a high-end SMP machine is orders of magnitude faster than communication between nodes in a commodity network-based cluster? Since workloads today are beyond the capability of any single machine (no matter how powerful). disk. The ﬁrst component measures how much of a building’s incoming power is actually delivered to computing equipment. it is important to factor in operational costs when deploying a scale-out solution based on large numbers of commodity servers. Of the three components of datacenter efﬁciency. is much more difﬁcult to measure. Operational costs are dominated by the cost of electricity to power the servers as well as other aspects of datacenter operations that are functionally related to power: power distribution. Datacenter efﬁciency is typically factored into three separate components that can be independently measured and optimized [18]. That is. a server at 10% utilization may draw slightly more than half as much power as a server at 100% utilization (which means that a lightly loaded server is much less efﬁcient than a heavily loaded server). air handling) and electrical infrastructure (e. The third component. As a result.2. how much is lost to the building’s mechanical systems (e. the conclusion appears to be clear: scaling “out” is superior to scaling “up”. [18. As a result. The third component captures how much of the power delivered to computing components (processor. etc. Barroso and Hölzle model these two approaches under workloads that demand more or less communication. The second component measures how much of a server’s incoming power is lost to the power supply. RAM. cooling. the ﬁrst two are relatively straightforward to objectively quantify.1. One important issue that has been identiﬁed is the non-linearity between load and power draw.

As an alternative to moving data around. even factoring in operational costs. system maintenance and hardware upgrades) and unexpected outages (e. we can take advantage of data locality by running code on the processor directly attached to the block of data we need. The distributed ﬁle system is responsible for managing the data over which MapReduce operates. Just as important. Assume failures are common. power failure.g. consult Barroso and Hölzle [18] and Hamilton [67]. That is. In traditional high-performance computing (HPC) applications (e. However. fault-tolerant service must cope with failures up to a point without impacting the quality of service—failures should not result in inconsistencies or indeterminism from the user perspective. evidence suggests that scaling out remains more attractive than scaling up. For more details. but yet retain the ability to power up (nearly) instantaneously in response to demand. it is more efﬁcient to move the processing around. Datacenters suffer from both planned outages (e. a 10.). in large clusters disk failures are common [123] and RAM experiences more errors than one might expect [135]. Many data-intensive workloads are not very processor-demanding. for climate or nuclear simulations). This means that any large-scale service that is distributed across a large cluster (either a user-facing application or a computing platform like MapReduce) must cope with hardware failures as an intrinsic aspect of its operation [66]. datacenter efﬁciency is a topic that is beyond the scope of this book. who provide detailed cost models for typical modern datacenters.. but commonplace. connectivity loss. Mature implementations of the MapReduce programming model are able to robustly cope with failures through a number of mechanisms such as automatic task restarts on different cluster nodes.. Move processing to the data..10 1.000-server cluster would still experience roughly 10 failures a day.g. That is.000-server cluster would still experience one failure daily.000 days (about 30 years) were achievable at realistic costs (which is unlikely). Data-intensive processing by deﬁnition means that the relevant datasets are too large to ﬁt in memory and must be held on disk.g. failures are not only inevitable. MapReduce assumes an architecture where processors and storage (disk) are co-located. which means that the separation of compute and storage creates a bottleneck in the network. A simple calculation sufﬁces to demonstrate: let us suppose that a cluster is built from reliable machines with a mean-time between failures (MTBF) of 1000 days (about three years). Although we have provided a brief overview here. At warehouse scale. Seek times for random disk access are fundamentally limited by the mechanical nature of the devices: read heads . INTRODUCTION idle processor would (ideally) consume no power. a server may fail at any time. For example. A well-designed. Even then. Process data sequentially and avoid random access. other cluster nodes should seamlessly step in to handle the load. a broken server that has been repaired should be able to seamlessly rejoin the service without manual reconﬁguration by the administrator. let us suppose that a MTBF of 10. For the sake of argument. In such a setup. As servers go down. etc. without notice. a 10. Even with these reliable servers. it is commonplace for a supercomputer to have “processing nodes” and “storage nodes” linked together by a high-capacity interconnect. and overall performance should gracefully degrade as server failures pile up.

Jacobs [76] provides real-world benchmarks in his discussion of large-data problems.. the biggest headache in distributed programming is that code runs concurrently in unpredictable orders.g. etc. Second. orders of magnitude faster than random data access. On the other hand. As a result. it is desirable to avoid random data access. This gives rise to race conditions. quiet ofﬁce. the process would ﬁnish in under a work day on a single machine.). at least in the near future. variable names) to the sophisticated (e. all computations are organized into long streaming operations that take advantage of the aggregate bandwidth of many disks in a cluster. A simple scenario10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 1010 100-byte records. To the extent possible.1. 11 For more detail. the cost differential between traditional magnetic disks and solid-state disks remains substantial: large-data will for the most part remain on mechanical drives. Hide system-level details from the application developer. deadlocks. literally. First. According to many guides on the practice of software engineering written by experienced industry professionals. . or machines. accessing data in unpredictable patterns. MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e. large monitors.2... BIG IDEAS 11 can only move so fast and platters can only spin so rapidly.. and other well-known problems. and instead organize computations so that data are processed sequentially.11 The development of solid-state drives is unlikely to change this balance for at least two reasons. The challenges in writing distributed software are greatly compounded—the programmer must manage details across several threads. Sequential data access is. Many aspects of MapReduce’s design explicitly trade latency for throughput.g. locking of data structures. although solid-state disks have substantially faster seek times. Given reasonable assumptions about disk latency and throughput. but the truth remains: concurrent programs are notoriously difﬁcult to reason about and even harder to debug.). and therefore is easy for the programmer to reason 10 Adapted from a post by Ted Dunning on the Hadoop mailing list. Of course. one of the key reasons why writing code is difﬁcult is because the programmer must simultaneously keep track of many details in short-term memory—ranging from the mundane (e.g. This imposes a high cognitive load and requires intense concentration. data starvation issues in the processing pipeline. a back-of-the-envelop calculation will show that updating 1% of the records (by accessing and then mutating each record) will take about a month on a single machine. The programming model speciﬁes simple and well-deﬁned interfaces between a small number of components. order-of-magnitude differences in performance between sequential and random access still remain. which leads to a number of recommendations about a programmer’s environment (e. a corner case of an algorithm that requires special treatment). MapReduce is primarily designed for batch processing over large datasets. Programmers are taught to use low-level devices such as mutexes and to apply high-level “design patterns” such as producer–consumer queues to tackle these challenges. processes. etc. comfortable furniture.g. if one simply reads the entire database and rewrites all the records (mutating those that need updating).

g. an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings: on data ranging from gigabytes to petabytes. the scaling “up” vs. as the amount of data is growing faster than the price of memory is falling. algorithms that require holding intermediate data in memory on a single machine will break on sufﬁciently large datasets— moving from a single machine to a cluster architecture requires fundamentally different algorithms (and reimplementations). Quite simply. But what happens when the amount of data doubles in the near future. . no more debugging race conditions and addressing lock contention) and can instead focus on algorithm or application design. while the second is exclusively the responsibility of the execution framework or “runtime”. the same is also true of algorithms: increasing the degree of parallelization also increases communication costs. the same algorithm should take at most twice as long to run. most algorithms today assume that data ﬁt in memory on a single machine. it goes without saying that scalable algorithms are highly desirable.This is because complex tasks cannot be chopped into smaller pieces and allocated in a linear fashion. and then doubles again shortly thereafter? Simply buying more memory is not a viable solution. Furthermore. for example. of course. the same algorithm should take no more than half as long to run. Although Brook’s observations are primarily about software engineers and the software development process. the price of a machine does not scale linearly with the amount of available memory beyond a certain point (once again.12 1.The advantage is that the execution framework only needs to be designed once and veriﬁed for correctness—thereafter. greater efﬁciencies gained by parallelization are entirely offset by increased communication requirements. 12 See also DeWitt and Gray [50] for slightly different deﬁnitions in terms of speedup and scaleup. The truth is that most current algorithms are far from the ideal. One of the fundamental assertions in Fred Brook’s classic The Mythical Man-Month [28] is that adding programmers to a project behind schedule will only make it fall further behind.12 First. Finally. MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines. this is a fair assumption. in terms of data: given twice the amount of data. In the domain of text processing. As an aspiration.. Second. We can deﬁne scalability along at least two dimensions. unobtainable. as long as the developer expresses computations in the programming model. For data-intensive processing. Nevertheless. algorithms with the characteristics sketched above are. The ﬁrst is under the control of the programmer. on clusters consisting of a few to a few thousand machines. code is guaranteed to behave as expected. and beyond a certain point. The upshot is that the developer is freed from having to worry about system-level details (e. and is often illustrated with a cute quote: “nine women cannot have a baby in one month”. scaling “out” argument). the ideal algorithm would exhibit these desired behaviors without requiring any modiﬁcations whatsoever. Furthermore. all else being equal. Seamless scalability. Other than for embarrassingly parallel problems. these fundamental limitations shouldn’t prevent us from at least striving for the unobtainable. INTRODUCTION about. The algorithm designer is faced with diminishing returns. For the most part. in terms of resources: given a cluster twice the size. not even tuning of parameters. let us sketch the behavior of an ideal algorithm.

WHY IS THIS DIFFERENT? 13 Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above. has not happened. This marked the beginning of an entirely new strategy and the dawn of the multi-core era [115]. then we should be able to ﬁnish in an hour on a cluster of 100 machines. deeper pipelines. a MapReduce algorithm remains ﬁxed. at least for some applications. all of that changed around the middle of the ﬁrst decade of this century. Yet. Unfortunately. where a 100-machine cluster running for one hour would cost the same 14 Guess when this was written? You may be surprised. for example. 1. Couple that with the inherent challenges of concurrency. Nevertheless. If running an algorithm on a particular dataset takes 100 machine hours. it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally intensive domains. for example). However. Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster. in many circumstances. this isn’t so far from the truth. The manner in which the semiconductor industry had been exploiting Moore’s Law simply ran out of opportunities for improvement: faster clocks. parallel processing became an important issue at the forefront of everyone’s mind—it represented the only way forward.3 WHY IS THIS DIFFERENT? “Due to the rapidly decreasing cost of processing. As a result. and communication. this radical shift in hardware architecture was not matched at that time by corresponding advances in how software could be easily designed for these new processors (but not for lack of trying [104]). or use a cluster of 10 machines to complete the same task in ten hours. The relentless progress of Moore’s Law for several decades has ensured that most of the world’s problems could be solved by single-processor machines. save the needs of a few (scientists simulating molecular interactions or nuclear reactions. until very recently. and it is the responsibility of the execution framework to execute the algorithm. the MapReduce programming model is simple enough that it is actually possible. memory. the above quote). however.13 With MapReduce. We introduce the idea of the “tradeable machine hour”. they have been wrong.3. 13 Note that this idea meshes well with utility computing. Amazingly. as a play on Brook’s classic title. .1. to approach the ideal scaling characteristics discussed above. computer scientists have predicted that the dawn of the age of parallel computing was “right around the corner” and that sequential processing would soon fade into obsolescence (consider. and other tricks of the trade reached a point of diminishing returns that did not justify continued investment. as a 10-machine cluster running for ten hours.” — Leslie Valiant [148]14 For several decades.This. and the result has been that parallel processing and distributed systems have largely been conﬁned to a small segment of the market and esoteric upper-level electives in the computer science curriculum. superscalar architectures.

not over individual machines. . The most prevalent model in theoretical computer science. To be fair. the PRAM has been recently prototyped in hardware [153]. and be happily processing terabytes upon terabytes of data within the week. relatively few organizations had dataintensive processing needs that required large clusters: a handful of internet companies and perhaps a few dozen large corporations. everything changed. and then by Yahoo and the open source community. but technological solutions for addressing them are widely accessible. which drove innovations in distributed computing such as MapReduce—ﬁrst by Google. 77].15 In the model. etc. Valiant called this a bridging model [148]. is the PRAM [60. MapReduce is important in how it has changed the way we organize computations at a massive scale. Until recently. they began instrumenting various business processes to gather even more data—driven by the belief that more data lead to deeper insights and greater competitive advantages. But beyond that. In the late 1990s and even during the beginning of the ﬁrst decade of this century. But then. it provides a very effective tool for tackling large-data problems. 119]. both in terms of multiple cores in a chip and multiple machines in a cluster (each of which often has multiple cores). rise of user-generated web content. MapReduce can be viewed as the ﬁrst breakthrough in the quest for new abstractions that allow us to organize computations. the computer scientists are right—the age of parallel computing has begun. the software industry developed software targeted at the model without worrying about the hardware details. large-data problems began popping up everywhere. Finally. Today. Today. As Barroso puts it. but over entire clusters. we witnessed the growth of large-data problems. the datacenter is the computer [18. an arbitrary number of processors. Anyone can download the open-source Hadoop implementation of MapReduce. which dates back several decades.14 1. however. the von Neumann model isn’t sufﬁcient anymore: we can’t treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect. MapReduce is certainly not the ﬁrst model of parallel computation that has been proposed. Similarly. and software engineers were able to develop applications in high-level languages that exploited those processors. a conceptual bridge between the physical implementation of a machine and the software that is to be executed on that machine. The result was extraordinary growth: chip designers churned out successive generations of increasingly powerful processors.). Data-intensive processing needs became widespread. the von Neumann model has served us well: Hardware designers focused on efﬁcient implementations of the von Neumann model and didn’t have to think much about the actual software that would run on the machines. Why is MapReduce important? In practical terms. MapReduce represents the ﬁrst widely adopted step away from the von Neumann model that has served as the foundation of computer science over the last half plus century. Through a combination of many different factors (falling prices of disks. INTRODUCTION At around the same time. This in turn created more demand: when organizations learned about the availability of effective data analysis tools for large datasets. not only are large-data problems ubiquitous. pay a modest fee to rent a cluster from a utility cloud provider. sharing an 15 More than a theoretical model. Such a view places too much burden on the software developer to effectively take advantage of available computational resources—it simply is the wrong level of abstraction.

none of these previous models have enjoyed the success that MapReduce has in terms of adoption and in terms of impact on the daily lives of millions of users. command-line invocations for running jobs.This book is about MapReduce algorithm design. abstractions manage complexity by hiding details and presenting well-deﬁned behaviors to users of those abstractions. are imperfect—making certain tasks easier but others more difﬁcult. we refer the reader to Tom White’s excellent book. it is important to understand the relationship between MapReduce and existing models so that we can bring to bear accumulated knowledge about parallel algorithms. not quite yet…A ﬁnal word before we get started. For reasons that are beyond the scope of this book. inevitably. for example. We don’t. This critique applies to MapReduce: it makes certain large-data problems easier.1. For those aspects. This is exactly the purpose of this book: let us now begin our exploration. operate synchronously on a shared input to produce some output. particularly for text processing (and related) applications.16 MapReduce is the most successful abstraction over large-scale computational resources we have seen to date. They. Although our presentation most closely follows the Hadoop open-source implementation of MapReduce. [82] demonstrated that a large class of PRAM algorithms can be efﬁciently simulated via MapReduce. Other models include LogP [43] and BSP [148]. but rather the ﬁrst in a new class of programming models that will allow us to more effectively organize computations at a massive scale. 16 Nevertheless.4 WHAT THIS BOOK IS NOT Actually. “Hadoop: The Deﬁnitive Guide” [154]. and sometimes. Karloff et al. as anyone who has taken an introductory computer science course knows. discuss APIs. for example. So if MapReduce is only the beginning. what’s next beyond MapReduce? We’re getting ahead of ourselves. but suffers from limitations as well. . impossible (in the case where the detail suppressed by the abstraction is exactly what the user cares about). as we can’t meaningfully answer this question before thoroughly understanding what MapReduce can and cannot do well. WHAT THIS BOOK IS NOT 15 unboundedly large memory. 1.4. etc. This means that MapReduce is not the ﬁnal word. this book is explicitly not about Hadoop programming. However.

.

and to remain ever vigilant for common problems such as deadlocks and race conditions. Intermediate results from each individual worker are then combined to yield the ﬁnal output. like OpenMP for shared memory parallelism. Language extensions.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems in many different application domains. In shared memory programming. they can be tackled in parallel by different workers—threads in a processor core. . the developer needs to explicitly address many (and sometimes. multiple processors in a machine. The basic idea is to partition a large problem into smaller sub-problems.org/ they are far from being sufﬁciently mature to solve real-world problems.. However. but 2 http://www. e. or many machines in a cluster. how do we decompose the problem so that the smaller tasks can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines (while keeping in mind that some workers are better suited to running some tasks than others. To the extent that the sub-problems are independent [5].17 CHAPTER 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer. cores in a multi-core processor. to explicitly handle process synchronization through devices such as barriers. the following are just some of the issues that need to be addressed: • How do we break up a large problem into smaller tasks? More speciﬁcally.)? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by another? • How do we accomplish all of the above in the face of software errors and hardware faults? In traditional parallel or distributed programming environments. etc. all) of the above issues. locality constraints. a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. the developer needs to explicitly coordinate access to shared data structures through synchronization primitives such as mutexes. the details of their implementations are varied and complex. For example.g.2 or libraries implementing the 1 We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift.openmp. due to available resources.

6 before the chapter ends with a summary.18 2. a developer can focus on what computations need to be performed. Large-data processing by deﬁnition requires bringing data and code together for computation to occur—no small feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a simple abstraction for the developer. which provide greater control over data ﬂow. or functions 3 http://www. developers are still burdened to keep track of how resources are made available to workers. Section 2. the programmer must devote a signiﬁcant amount of attention to low-level system details. However. Like OpenMP and MPI.Tying everything together. which detracts from higher-level problem solving.4 ﬁlls in additional details by introducing partitioners and combiners.mcs. When using existing parallel computing approaches for large-data computation. robust. focusing on mappers and reducers. We start in Section 2. as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. MAPREDUCE BASICS Message Passing Interface (MPI) for cluster-level parallelism. and efﬁcient manner.4 A key feature of functional languages is the concept of higher-order functions. transparently handling most of the details behind the scenes in a scalable. instead of moving large amounts of data around. As we mentioned in Chapter 1. which is exempliﬁed in languages such as Lisp and ML.anl. Additionally. these frameworks are mostly designed to tackle processor-intensive problems and have only rudimentary support for dealing with very large amounts of input data. Section 2. Section 2. MapReduce would not be practical without a tightly-integrated distributed ﬁle system that manages the data being processed. a complete cluster architecture is described in Section 2. 2.3 discusses the role of the execution framework in actually running MapReduce programs (called jobs). MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a different level of granularity). organizing and coordinating large amounts of computation is only part of the challenge. Section 2. This is operationally realized by spreading data across the local disks of nodes in a cluster and running processes on nodes that hold the data. However. .1 with an overview of functional programming. from which MapReduce draws its inspiration. there are important characteristics of MapReduce that make it non-functional in nature—this will become apparent later.1 FUNCTIONAL PROGRAMMING ROOTS MapReduce has its roots in functional programming.gov/mpi/ 4 However. to move the code to the data. even with these extensions. One of the most signiﬁcant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the programmer. it is far more efﬁcient.5 covers this in detail.3 provide logical abstractions that hide details of operating system synchronization and communications primitives. if possible. Therefore. The complex task of managing storage in such a processing environment is typically handled by a distributed ﬁle system that sits underneath MapReduce.2 introduces the basic programming model. This chapter introduces the MapReduce programming model and the underlying distributed ﬁle system.

two higher-order functions commonly used together in functional programming: map takes a function f and applies it to every element in a list.x 2 ) over the input list. and then fold the resulting list with the addition function (more precisely. To the extent that elements in the list can be divided into groups.. one could map a function that squares its argument (i. In a cluster. the fold aggregations can also proceed . the result of which is stored in an intermediate variable. map takes as an argument a function f (that takes a single argument) and applies it to all elements in a list (the top part of the diagram). the results of which are stored in the intermediate variable.x + y) using an initial value of zero. fold then returns the ﬁnal value of the intermediate variable. We can view map as a concise way to represent the transformation of a dataset (as deﬁned by the function f ). For example. many real-world applications do not require g to be applied to all elements of the list. has more restrictions on data locality—elements in the list must be “brought together” before the function g can be applied.Typically.e. while fold iteratively applies a function g to aggregate results. The fold operation. since each functional application happens in isolation.1. This process repeats until all items in the list have been consumed. λxλy. λx. In the same vein. FUNCTIONAL PROGRAMMING ROOTS 19 f f f f f g g g g g Figure 2. Given a list. fold takes as arguments a function g (that takes two arguments) and an initial value: g is ﬁrst applied to the initial value and the ﬁrst item in the list.1. we can view fold as an aggregation operation. This intermediate variable and the next item in the list serve as the arguments to a second application of g. Given a list.1: Illustration of map and fold. illustrated in Figure 2. as deﬁned by the function g. map and fold are used in combination. One immediate observation is that the application of f to each item in a list (or more generally. that can accept other functions as arguments. to compute the sum of squares of a list of integers.2. these operations can be distributed across many different machines. on the other hand. to elements in a large dataset) can be parallelized in a straightforward manner. However. Two common built-in higher order functions are map and fold.

apache. the MapReduce execution framework coordinates the map and reduce phases of processing over large amounts of data on large clusters of commodity machines. For a collection of web pages.google. MapReduce can refer to three distinct but related concepts.2 MAPPERS AND REDUCERS Key-value pairs form the basic data structure in MapReduce. e. Google’s proprietary implementation vs. for the CELL architecture [126].g.com/p/protobuf/ 6 http://incubator. Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets. In the ﬁrst stage. the open-source Hadoop implementation in Java. there are many implementations of MapReduce. Viewed from a slightly different angle. Although such a two-stage processing structure may appear to be very restrictive. ﬂoating point values. and therefore the one most developers are likely to use. etc. Keys and values may be primitives such as integers. and the execution framework coordinates the actual processing (very loosely. However. or they may be arbitrarily complex structures (lists. for GPGPUs [71]..apache. MapReduce can refer to the software implementation of the programming model and the execution framework: for example. Furthermore. the “runtime”) that coordinates the execution of programs written in this particular style. Finally. 2.. tuples. many interesting algorithms can be expressed quite concisely—especially if one decomposes complex algorithms into a sequence of MapReduce jobs.org/avro/ . These operations occur in parallel and yield intermediate output that is then aggregated by another user-speciﬁed computation.20 2. in fact. Programmers typically need to deﬁne their own custom data types. we have described MapReduce. Second. targeted speciﬁcally for multi-core processors [127]. associative arrays. First. As we will discuss in detail shortly.). for operations that are commutative and associative. MapReduce can refer to the execution framework (i. MAPREDUCE BASICS in parallel. since Hadoop remains the most mature and accessible implementation to date. a user-speciﬁed computation is applied over all input records in a dataset. which we will explicitly discuss throughout the book.org/thrift/ 7 http://hadoop. etc. And.5 Thrift. There are some differences between the MapReduce programming model implemented in Hadoop and Google’s proprietary implementation. which is the sense discussed above.e. In a nutshell. Subsequent chapters in this book focus on how a number of algorithms can be implemented in MapReduce. The map phase in MapReduce roughly corresponds to the map operation in functional programming. strings. MapReduce provides a functional abstraction). MapReduce is a programming model. and raw bytes. signiﬁcant efﬁciencies can be gained in the fold operation through local aggregation and appropriate reordering. To be precise. The programmer deﬁnes these two types of computations. we take a rather Hadoop-centric view of MapReduce.6 and Avro7 simplify the task. MapReduce codiﬁes a generic “recipe” for processing large datasets that consists of two stages. whereas the reduce phase in MapReduce roughly corresponds to the fold operation in functional programming. keys may be URLs and values may be the actual 5 http://code. although a number of libraries such as Protocol Buffers.

which we discuss later in Section 2. tokenizes the document. For a graph. where the former is a unique identiﬁer for the document. Therefore. MAPPERS AND REDUCERS 21 HTML content. See Section 2. keys may represent node ids and values may contain the adjacency lists of those nodes (see Chapter 5 for more details). For the most part. v3 )] The convention [. doc) pairs stored on the distributed ﬁle system. The mapper is applied to every input key-value pair (split across an arbitrary number of ﬁles) to generate an arbitrary number of intermediate key-value pairs. The output ends up in r ﬁles on the distributed ﬁle system. Words within each ﬁle will be sorted by alphabetical order. and emits an intermediate keyvalue pair for every word: the word itself serves as the key. Intermediate data arrive at each reducer in order. A simple word count algorithm in MapReduce is shown in Figure 2.3.2 illustrates this two-stage processing structure.. Input key-values pairs take the form of (docid. . v1 ) → [(k2 . The input to a MapReduce job starts as data stored on the underlying distributed ﬁle system (see Section 2. the reducer is presented with a key and an iterator over all values 8This characterization. we simply need to sum up all counts (ones) associated with each word. one ﬁle per reducer. Final output is written to the distributed ﬁle system. However. which may be the ﬁrst step in. controls the assignment of words to reducers.8 Implicit between the map and reduce phases is a distributed “group by” operation on intermediate keys. building a unigram language model (i.e. The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.5). Figure 2. Output key-value pairs from each reducer are written persistently back onto the distributed ﬁle system (whereas intermediate key-value pairs are transient and not preserved). while in other cases input keys are used to uniquely identify a datum (such as a record id). while conceptually accurate. and the latter is the text of the document itself. probability distribution over words in a collection). There are some differences between the Hadoop implementation of MapReduce and Google’s implementation. . .9 In Hadoop.2.2. there is no need to consolidate reducer output.6 for more details. and the count as the value. we discuss the role of complex keys and values in the design of various algorithms. input keys are not particularly meaningful and are simply ignored during processing. since the r ﬁles often serve as input to yet another MapReduce job. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. The reducer does exactly this. and each ﬁle will contain roughly the same number of words. where r is the number of reducers. The output can be examined by the programmer or used as input to another MapReduce program. no ordering relationship is guaranteed for keys across different reducers. Jeff Dean. the programmer deﬁnes a mapper and a reducer with the following signatures: map: (k1 . The mapper takes an input key-value pair. In Chapter 3. sorted by the key. In some algorithms. and the integer one serves as the value (denoting that we’ve seen the word once).] is used throughout this book to denote a list. The partitioner. is a slight simpliﬁcation. in our word count algorithm. 9 Personal communication.4. for example. This algorithm counts the number of occurrences of every word in a text collection. [v2 ]) → [(k3 . v2 )] reduce: (k2 . In MapReduce. and emits ﬁnal key-value pairs with the word as the key.

The reducer sums up all counts for each word. .2: Simpliﬁed view of MapReduce. counts [c1 .] do sum ← sum + c Emit(term t. .22 2.3: Pseudo-code for the word count algorithm in MapReduce. . which generate an arbitrary number of intermediate key-value pairs.The mapper emits an intermediate key-value pair for each word in a document. 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. doc d) for all term t ∈ doc d do Emit(term t. c2 . count sum) Figure 2. Between the map and reduce phases lies a barrier that involves a large distributed sort and group by. MAPREDUCE BASICS A B C D E F mapper mapper mapper mapper a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2.]) sum ← 0 for all count c ∈ counts [c1 . . . c2 . Reducers are applied to all values associated with the same key. . . Mappers are applied to all input key-value pairs. count 1) class Reducer method Reduce(term t.

What are the restrictions on mappers and reducers? Mappers and reducers can express arbitrary computations over their inputs.6).6.5 and Section 2.2. of course. and they need not be of the same type as the input key-value pairs.5). a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the Map method is called on each key-value pair by the execution framework. The values are arbitrarily ordered. mappers can emit an arbitrary number of intermediate key-value pairs. reducers can emit an arbitrary number of ﬁnal key-value pairs. Such algorithms can be understood as having side effects that only change state that is internal to the mapper or reducer. one must generally be careful about use of external resources since multiple mappers or reducers may be contending for those resources. and they can differ in type from the intermediate key-value pairs. Although not permitted in functional programming. Mappers and reducers are objects that implement the Map and Reduce methods. and the reducer can emit an arbitrary number of output key-value pairs (with different keys). the programmer provides a hint on the number of map tasks to run. which is dependent on an understanding of the distributed ﬁle system (covered in Section 2. the reducer output key must be exactly the same as the reducer input key. To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors how MapReduce programs are written in Hadoop. This is a powerful and useful feature: for example. Another difference: in Google’s implementation the programmer is not allowed to change the key in the reducer. MAPPERS AND REDUCERS 23 associated with the particular key. While the correctness of such algorithms may be more difﬁcult to guarantee (since the function’s behavior depends not only on the current input but on previous inputs). For example. respectively. In contrast with the number of map tasks. In Hadoop. The situation is similar for the reduce phase: a reducer object is initialized for each reduce task. our focus is on algorithm design and conceptual understanding—not actual Hadoop programming. To reiterate: although the presentation of algorithms in this book closely mirrors the way they would be implemented in Hadoop. there is no such restriction. That is. the programmer can precisely specify the number of reduce tasks.10 In general. since that would introduce a scalability bottleneck on the number of map tasks that could be run in parallel (since they might all be simultaneously querying the database). . we recommend Tom White’s book [154]. For that. Google’s implementation allows the programmer to specify a secondary sort key for ordering the values (if desired)—in which case values associated with each key would be presented to the developer’s reduce code in sorted order. In Hadoop. In conﬁguring a MapReduce job. most potential synchronization problems are avoided since internal state is private only to individual map10 Unless. Similarly. it may be unwise for a mapper to query an external SQL database. However. Later in Section 3. We will return to discuss the details of Hadoop job execution in Section 2. but the execution framework (see next section) makes the ﬁnal determination based on the physical layout of the data (more details in Section 2.4 we discuss how to overcome this limitation in Hadoop to perform secondary sorting. and the Reduce method is called once per intermediate key. mappers and reducers can have side effects. the database itself is highly scalable.2. preserving state across multiple inputs is central to the design of many MapReduce algorithms (see Chapter 3).

In addition to the “canonical” MapReduce processing ﬂow. MAPREDUCE BASICS pers and reducers. it may be useful for mappers or reducers to have external side effects. computing π ) or may only consume a small amount of data (e. HBase is an open-source Bigtable clone and has similar capabilities.. input to a MapReduce job comes from data stored on the distributed ﬁle system and output is written back to the distributed ﬁle system. distributed. a map task may be responsible for processing a certain block of input key-value pairs (called an input split in Hadoop). special care must be taken to ensure that such operations avoid synchronization conﬂicts. Each MapReduce job is divided into smaller units called tasks (see Section 2. Bigtable [34]. such as writing ﬁles to the distributed ﬁle system. 2.3 THE EXECUTION FRAMEWORK One of the most important ideas behind MapReduce is separating the what of distributed processing from the how. running identity mappers and reducers has the effect of regrouping and resorting the input data (which is sometimes useful). referred to as a job.. on clusters ranging from a single node to a few thousand nodes. consists of code for mappers and reducers (as well as combiners and partitioners to be discussed in the next section) packaged together with conﬁguration parameters (such as where the input lies and where the output should be stored). any other system that satisﬁes the proper abstractions can serve as a data source or sink. this would be a common pattern. This has the effect of sorting and regrouping the input for reduce-side processing. The developer submits the job to the submission node of a cluster (in Hadoop. a reduce task may handle a portion of the . With Google’s MapReduce implementation. Hadoop has been integrated with existing MPP (massively parallel processing) relational databases. For example. MapReduce programs can contain no reducers. Although in the most common case.g. in some cases it is useful for the reducer to implement the identity function. parse a large text collection or independently analyze a large number of images. e. In other cases (see Sections 4. which allows a programmer to write MapReduce jobs over database rows and dump output into a new database table. input parameters to many instances of processor-intensive simulations running in parallel).5). Since many mappers and reducers are run in parallel.4 and 6. in some cases MapReduce jobs may not consume any input at all (e. this is called the jobtracker) and execution framework (sometimes called the “runtime”) takes care of everything else: it transparently handles all other aspects of distributed code execution.g. Finally.g. in which case the program simply sorts and groups mapper output. although in some cases it is useful for the mapper to implement the identity function and simply pass input key-value pairs to the reducers. For embarrassingly parallel problems. in which case mapper output is directly written to disk (one ﬁle per mapper). is frequently used as a source of input and as a store of MapReduce output. Also. and the distributed ﬁle system is a shared global resource.6 for more details). A MapReduce program. similarly. a sparse. Speciﬁc responsibilities include: Scheduling.24 2. persistent multidimensional sorted map. One strategy is to write a temporary ﬁle that is renamed upon successful completion of the mapper or reducer [45]. other variations are also possible. Similarly. The converse—a MapReduce program with no mappers—is not possible. Finally..

or tasks that take an usually long time to complete.g. the scheduler starts tasks on the node that holds a particular block of data (i. we need to somehow feed data to the code. and the necessary data will be streamed over the network. Another aspect of scheduling involves coordination among tasks belonging to different jobs (e. however.11 To achieve data locality.2. As a result. the common wisdom is that the technique is more helpful for map tasks than reduce tasks. Recall that MapReduce may receive its input from other sources.e. Better local aggregation. Zaharia et al. which means that the task or tasks responsible for processing the most frequent few elements will run much longer than the typical task. new tasks will be started elsewhere. The phrase data distribution is misleading. transparent. from different users).. In large jobs.3. that speculative execution cannot adequately address another common cause of stragglers: skew in the distribution of values associated with intermediate keys (leading to reduce stragglers). is one possible solution to this problem. Speculative execution is an optimization that is implemented by both Hadoop and Google’s MapReduce implementation (called “backup tasks” [45]). Note. and the framework simply uses the result of the ﬁrst task attempt to ﬁnish. 11 In the canonical case. [161] presented different execution strategies in a recent paper. 160]. a node is already running too many tasks). One cause of stragglers is ﬂaky hardware: for example. discussed in the next chapter. Data/code co-location. An important optimization here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant data block. If this is not possible (e. a machine that is suffering from recoverable errors may become signiﬁcantly slower. on its local drive) needed by the task. With speculative execution. Although in Hadoop both map and reduce tasks can be speculatively executed. policy-driven fashion? There has been some recent work along these lines in the context of Hadoop [131. the completion time of a job is bounded by the running time of the slowest reduce task. the more general point remains—in order for computation to occur. However. . not the data.. It is not uncommon for MapReduce jobs to have thousands of individual tasks that need to be assigned to nodes in the cluster. that is. and Google reported that speculative execution can improve job running times by 44% [45]. the map phase of a job is only as fast as the slowest map task. making it necessary for the scheduler to maintain some sort of a task queue and to track the progress of running tasks so that waiting tasks can be assigned to nodes as they become available. In MapReduce. How can a large. the total number of tasks may exceed the number of tasks that can be run on the cluster concurrently. since one of the key ideas behind MapReduce is to move the code. In text processing we often observe Zipﬁan distributions. an identical copy of the same task is executed on a different machine. Similarly. THE EXECUTION FRAMEWORK 25 intermediate key space.g. the speed of a MapReduce job is sensitive to what are known as stragglers. shared resource support several users simultaneously in a predictable. Due to the barrier between the map and reduce tasks. This has the effect of moving code to the data. since inter-rack bandwidth is signiﬁcantly less than intra-rack bandwidth. since each copy of the reduce task needs to pull data over the network. this issue is inextricably intertwined with scheduling and relies heavily on the design of the underlying distributed ﬁle system..

synchronization is accomplished by a barrier between the map and reduce phases of processing. No software is bug free—exceptions must be appropriately trapped. since the execution framework cannot otherwise guarantee that all values associated with the same key have been gathered. power failure.. the reducer in MapReduce receives all values associated with the same key at once. and therefore the process is commonly known as “shufﬂe and sort”. logged.g. not the exception. synchronization refers to the mechanisms by which multiple concurrently running processes “join up”.). In MapReduce. Error and fault handling. for example. it is possible to start copying intermediate key-value pairs over the network to the nodes running the reducers as soon as each mapper ﬁnishes—this is a common optimization and implemented in Hadoop.. A MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations. There are two additional elements that complete the programming model: partitioners and combiners. Partitioners are responsible for dividing up the intermediate key space and assigning intermediate key-value pairs to reducers. MAPREDUCE BASICS Synchronization. any sufﬁciently large dataset will contain corrupted data or records that are mangled beyond a programmer’s imagination—resulting in errors that one would never think to check for or trap. the aggregation function g is a function of the intermediate value and the next item in the list— which means that values can be lazily generated and aggregation can begin as soon as values are available. Furthermore. Intermediate key-value pairs must be grouped by key. the partitioner speciﬁes the task to which an intermediate key-value pair must be copied. the runtime must be especially resilient. connectivity loss. In general.26 2. system maintenance and hardware upgrades) and unexpected outages (e. keys are processed in sorted order . Note that the reduce computation cannot start until all the mappers have ﬁnished emitting key-value pairs and all intermediate key-value pairs have been shufﬂed and sorted. And that’s just hardware. This necessarily involves copying intermediate data over the network. The MapReduce execution framework must accomplish all the tasks above in an environment where errors and faults are the norm. However. and recovered from. Large-data problems have a penchant for uncovering obscure corner cases in code that is otherwise thought to be bug-free. etc. Since MapReduce was explicitly designed around low-end commodity servers. 2. to share intermediate results or otherwise exchange state information. In large clusters. since each mapper may have intermediate output going to every reducer. disk failures are common [123] and RAM experiences more errors than one might expect [135]. which is accomplished by a large distributed sort involving all the nodes that executed map tasks and all the nodes that will execute reduce tasks.g. In contrast. The MapReduce execution framework must thrive in this hostile environment.4 PARTITIONERS AND COMBINERS We have thus far presented a simpliﬁed view of MapReduce. Within each reducer. In other words. This is an important departure from functional programming: in a fold operation. Datacenters suffer from both planned outages (e.

The partitioner determines which reducer will be responsible for processing a particular 12 A note on the implementation of combiners in Hadoop: by default. Section 3. to compute a local count for a word over all the documents processed by the mapper.4.2. As a result. which emits a key-value pair for each word in the collection.1. Note. but the keys and values must be of the same type as the mapper output (same as the reducer input). i. that the partitioner only considers the key and ignores the value—therefore. or multiple times. In many cases. which perform local aggregation to cut down on the number of intermediate keyvalue pairs. Output of the mappers is processed by the combiners. With this modiﬁcation (assuming the maximum amount of local aggregation possible). reducers and combiners are not interchangeable. In addition. which focuses on various techniques for local aggregation. but before the user reducer code runs.This topic will be discussed in Section 3.2 discusses this in more detail. This assigns approximately the same number of keys to each reducer (dependent on the quality of the hash function). The combiner can emit any number of key-value pairs...g. PARTITIONERS AND COMBINERS 27 (which is how the “group by” is implemented).3.1. combiners in Hadoop may actually be invoked in the reduce phase. i. one cannot assume that a combiner will have the opportunity to process all values associated with the same key. one. Each combiner operates in isolation and therefore does not have access to intermediate output from other mappers. the execution framework reserves the right to use combiners at its discretion.e. and so the amount of intermediate data will be larger than the input collection itself. The complete MapReduce model is shown in Figure 2. The simplest partitioner involves computing the hash value of the key and then taking the mod of that value with the number of reducers.4. . One solution is to perform local aggregation on the output of each mapper. the number of intermediate key-value pairs will be at most the number of unique words in the collection times the number of mappers (and typically far smaller because each mapper may not encounter every word). a roughly even partitioning of the key space may nevertheless yield large differences in the number of key-values pairs sent to each reducer (since different keys may have different numbers of associated values). Critically. Combiners are an optimization in MapReduce that allow for local aggregation before the shufﬂe and sort phase. In reality. prior to the shufﬂe and sort phase. This imbalance in the amount of data associated with each key is relatively common in many text processing applications due to the Zipﬁan distribution of word occurrences. after key-value pairs have been copied over to the reducer. however. however. reducers can directly serve as combiners. It sufﬁces to say for now that a combiner can signiﬁcantly reduce the amount of data that needs to be copied over the network.. resulting in much faster algorithms. combiners must be carefully written so that they can be executed in these different environments. proper use of combiners can spell the difference between an impractical algorithm and an efﬁcient algorithm. This is clearly inefﬁcient. One can think of combiners as “mini-reducers” that take place on the output of the mappers. Furthermore. In general. this means that a combiner may be invoked zero. The combiner in MapReduce supports such an optimization. The combiner is provided keys and values associated with each key (the same types as the mapper output keys and values). addition or multiplication).12 In cases where an operation is both associative and commutative (e. We can motivate the need for combiners by considering the word count algorithm in Figure 2.e. all these key-value pairs need to be copied across the network.

Partitioners determine which reducer is responsible for a particular key. The execution framework handles everything else.13 Therefore. there is nothing to compute on. MAPREDUCE BASICS A B C D E F mapper pp mapper pp mapper pp mapper pp a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combiner combiner combiner combiner a 1 b 2 c 9 a 5 c 2 b 7 c 8 p partitioner p partitioner p partitioner p partitioner Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reducer reducer reducer X 5 Y 7 Z 9 Figure 2. along with job conﬁguration parameters. and partitioner. and often even separate networks). . In high-performance computing (HPC) and many traditional cluster architectures.4 is conceptually accurate. storage is viewed as a distinct and separate component from computation. but network-attached storage (NAS) and storage area networks (SAN) are common. and the execution framework uses this information to copy the data to the right location during the shufﬂe and sort phase.4: Complete view of MapReduce. so while Figure 2. Implementations vary widely. combiner. it doesn’t precisely describe the Hadoop implementation. reducer. 2. partitioners are actually executed before combiners. key.5 THE DISTRIBUTED FILE SYSTEM So far. the 13 In Hadoop. Regardless of the details. but it is important to recognize that without data. supercomputers often have dedicated subsystems for handling storage (separate nodes. a complete MapReduce job consists of code for the mapper.28 2. Combiners can be viewed as “mini-reducers” in the map phase. illustrating combiners and partitioners in addition to mappers and reducers. we have mostly focused on the processing aspect of data-intensive processing.

this is not a cost-effective solution.. of course. At that point. however. In GFS.15 This book adopts the Hadoop terminology. and access permissions) and the slaves manage the actual data blocks.. an application client wishing to read a ﬁle (or a portion thereof ) must ﬁrst contact the namenode to determine where the actual data is stored. the link between the compute nodes and the storage becomes a bottleneck. Blocking data.5. or they may refer to daemons running on those machines providing the relevant services. The client then contacts the datanode to retrieve the data. a switch with ten times the capacity is usually more than ten times more expensive). is not a new idea.2. directory structure. 133.. more compute capacity is required for processing. process the data. all data transfer 14 However. ﬁle to block mapping. Although MapReduce doesn’t necessarily require the distributed ﬁle system. respectively. The distributed ﬁle system adopts a master–slave architecture in which the master maintains the ﬁle namespace (metadata. Blocks are themselves stored on standard single-machine ﬁle systems.14 Of course. so HDFS lies on top of the standard OS stack (e. In Hadoop. Alternatively. THE DISTRIBUTED FILE SYSTEM 29 processing cycle remains the same at a high level: the compute nodes fetch input from storage. In response to the client request.. 10 gigabit Ethernet) or special-purpose interconnects such as InﬁniBand (even more expensive). remains an experimental use case. and then write back the results (with perhaps intermediate checkpointing for long-running processes). and the slaves are called GFS chunkservers.5. In HDFS. there is evidence that existing POSIX-based distributed cluster ﬁle systems (e. Linux). An important feature of the design is that data is never moved through the namenode. and therefore departs from previous architectures in certain respects (see discussion by Ghemawat et al. distributed ﬁle systems are not new [7. which datanode).g.The Google File System (GFS) [57] supports Google’s proprietary implementation of MapReduce. Instead. the same roles are ﬁlled by the namenode and datanodes. The MapReduce distributed ﬁle system builds on previous work but is speciﬁcally adapted to large-data processing workloads. HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop. As dataset sizes increase. This. 74. [57] in the original GFS paper.g. as the price of networking equipment increases non-linearly with performance (e.). 32.. But as compute capacity grows.g. In most cases. The distributed ﬁle system (DFS) that underlies MapReduce adopts exactly this approach. . although for most basic ﬁle operations GFS and HDFS work much the same way. but DFS blocks are signiﬁcantly larger than block sizes in typical single-machine ﬁle systems (64 MB by default). 147]. 15To be precise. it is difﬁcult to realize many of the advantages of the programming model without a storage substrate that behaves much like the DFS. the master is called the GFS master. namenode and datanode may refer to physical machines in a cluster. The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster. the namenode returns the relevant block id and the location where the block is held (i. The architecture of HDFS is shown in Figure 2. when properly tuned or modiﬁed for MapReduce workloads [6. load the data into memory. 146]. one could abandon the separation of computation and storage as distinct components in a cluster. in the open-source world.e. location of blocks.g. redrawn from a similar diagram describing GFS [57]. one could invest in higher performance but more expensive networks (e. GPFS or PVFS) can serve as a replace- ment for HDFS.

ﬁles are immutable—they cannot be modiﬁed after creation. the namenode directs the creation of additional copies.g. block id) /foo/bar File namespace block 3df2 instructions to datanode datanode state (block id. occurs directly between clients and datanodes.5: The architecture of HDFS. The namenode allocates a new block on a suitable datanode. a repaired node rejoins the cluster). the three replicas are spread across different physical racks. data is further propagated to additional replicas. so HDFS is resilient towards two common failure scenarios: individual datanode crashes and failures in networking equipment that bring an entire rack ofﬂine. due to disk or machine failures or to connectivity losses due to networking equipment failures). The namenode (master) is responsible for maintaining the ﬁle namespace and directing clients to datanodes (slaves) that actually hold data blocks containing user data. In the most recent release of Hadoop as of this writing (release 0. but data transfer occurs directly from datanode to datanode.2). block location) (file name. MAPREDUCE BASICS HDFS namenode Application HDFS Client (block id. HDFS stores three separate copies of each data block to ensure reliability. the application client ﬁrst contacts the namenode. The namenode is in periodic communication with the datanodes to ensure proper replication of all the blocks: if there aren’t enough replicas (e. and performance. extra copies are discarded. By default. Replicating blocks across physical machines also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs. byte range) block data HDFS datanode Linux file system HDFS datanode Linux file system … … Figure 2. which is a feature already present in GFS. availability. which updates the ﬁle namespace after checking permissions and making sure the ﬁle doesn’t already exist. In summary. .. In large clusters. To create a new ﬁle and write data to HDFS.16 if there are too many replicas (e. communication with the namenode only involves transfer of metadata. since multiple copies yield more opportunities to exploit locality. There are current plans to ofﬁcially support ﬁle appends in the near future.g.20. and the application is directed to stream data directly to it. the HDFS namenode has the following responsibilities: 16 Note that the namenode coordinates the replication process. From the initial datanode..30 2.

These data are held in memory for fast access and all mutations are persistently logged. rebalancing involves moving blocks from datanodes with more blocks to datanodes with fewer blocks. it will direct the creation of new replicas. mappers in a MapReduce job use individual ﬁles as a basic unit for splitting input data. which in turn inﬂuenced the design of the systems.5. 17 In Hadoop. As a result. The namenode is in periodic contact with the datanodes via heartbeat messages to ensure the integrity of the system. certain datanodes may end up holding more blocks than others. but in HDFS multi-gigabyte ﬁles are common (and even encouraged). Understanding these choices is critical to designing effective MapReduce algorithms: • The ﬁle system stores a relatively modest number of large ﬁles. . • Maintaining overall health of the ﬁle system. All data transfers occur directly between clients and datanodes. At present. 18 According to Dhruba Borthakur in a post to the Hadoop mailing list on 6/8/2008. each block in HDFS occupies about 150 bytes of memory on the namenode. the namenode is also responsible for rebalancing the ﬁle system.18 Large multi-block ﬁles represent a more efﬁcient use of namenode memory than many single-block ﬁles (each of which consumes less space than a single block size). and access permissions. mapping over many small ﬁles will yield as many map tasks as there are ﬁles. this is a manually invoked process. and allocates blocks on suitable datanodes for write operations. the startup costs of mappers may become signiﬁcant compared to the time spent actually processing input key-value pairs. second. This results in two potential problems: ﬁrst. there is no default mechanism in Hadoop that allows a mapper to process multiple ﬁles. This leads to better load balancing and more even disk utilization. respectively. In addition. ﬁle to block mapping. rather. location of blocks. this may result in an excessive number of across-the-network copy operations during the “shufﬂe and sort” phase (recall that a MapReduce job with m mappers and r reducers involves up to m × r distinct copy operations). Since the namenode must hold all ﬁle metadata in memory. directory structure.2. THE DISTRIBUTED FILE SYSTEM 31 • Namespace management. The namenode directs application clients to datanodes for read operations.17 During the course of normal operations. HDFS does not immediately reclaim the available physical storage. The namenode is responsible for maintaining the ﬁle namespace. If the namenode observes that a data block is under-replicated (fewer copies are stored on datanodes than the desired replication factor). Since GFS and HDFS were speciﬁcally designed to support Google’s proprietary and the opensource implementation of MapReduce. There are several reasons why lots of small ﬁles are to be avoided. they were designed with a number of assumptions about the operational environment. When a ﬁle is deleted. Finally. which includes metadata. this presents an upper bound on both the number of ﬁles and blocks that can be supported. • Coordinating ﬁle operations. The deﬁnition of “modest” varies by the size of the deployment. blocks are lazily garbage collected.

in terms of resolving inconsistent states and optimizing the layout of data structures. The single-master design of GFS and HDFS is a well-known weakness. the entire ﬁle system becomes unavailable. If the master (HDFS namenode or GFS master) goes down. HDFS is designed around a number of self-monitoring and self-healing mechanisms to robustly cope with common failure modes. availability. there are existing plans to integrate Kerberos into Hadoop/HDFS. Furthermore. high sustained bandwidth is more important than low latency. Finally. the entire ﬁle system and all MapReduce jobs running on top of it will grind to a halt. this single point of failure is not as severe a limitation as it may appear—with diligent monitoring of the namenode. and in essence pushes part of the data management onto the end application. which are batch operations on large amounts of data. the Hadoop community is 19 However. caching. [4. Neither HDFS nor GFS present a general POSIX-compliant API. failures are the norm rather than the exception.32 2. It has been demonstrated that in large-scale distributed systems. another can step in) at the cost of consistency. Because of this. the real tradeoff is between consistency and availability. One rationale for this decision is that each application knows best how to handle data speciﬁc to that application. Recall that no data is ever moved through the namenode and that all communication between clients and datanodes involve only metadata. There is no discussion of security in the original GFS paper. and partition tolerance is impossible—this is Brewer’s so-called CAP Theorem [58]. • The ﬁle system is deployed in an environment of cooperative users. Since partitioning is unavoidable in large-data systems. File permissions in HDFS are only meant to prevent unintended operations and can be easily circumvented. MAPREDUCE BASICS • Workloads are batch oriented. some discussion is necessary to understand the single-master design of HDFS and GFS. As a result. As a result. and for the most part avoids load-induced crashes. A single-master design trades availability for consistency and signiﬁcantly simpliﬁes implementation. there is still OS-level 20 However. dominated by long streaming reads and large sequential writes.19 • Applications are aware of the characteristics of the distributed ﬁle system. for example. since the distributed ﬁle system is built on top of a standard operating system such as Linux. but rather support only a subset of possible ﬁle operations. simultaneously providing consistency. . mean time between failure measured in months are not uncommon for production deployments. Due to the common-case workload. not to mention requiring a more complex implementation (cf.20 • The system is built from unreliable but inexpensive commodity components.This exactly describes the nature of MapReduce jobs. This weakness is mitigated in part by the lightweight nature of ﬁle system operations. An alternative design might involve multiple masters that jointly manage the ﬁle namespace— such an architecture would increase availability (if one goes down. but HDFS explicitly assumes a datacenter environment where only authorized users have access. since if the master goes ofﬂine. which trivially guarantees that the ﬁle system will never be in an inconsistent state. In practice. This simpliﬁes the design of the distributed ﬁle system. both HDFS and GFS do not implement any form of data caching. the namenode rarely is the bottleneck. 105]).

depends on many factors: the number of mappers speciﬁed by the . If a tasktracker is available to run tasks (in Hadoop parlance. for serving HDFS data. although in smaller clusters they are often co-located. well aware of this problem and has developed several reasonable workarounds—for example. these services run on two separate machines. and a datanode daemon.6.6. The job submission node runs the jobtracker. a warm standby namenode that can be quickly switched over when the primary namenode fails. which is the single point of contact for a client wishing to execute a MapReduce job. A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks. has empty task slots).2. which is responsible for actually running user code. Each of the slave nodes runs a tasktracker for executing map and reduce tasks and a datanode daemon for serving HDFS data. The jobtracker monitors the progress of running MapReduce jobs and is responsible for coordinating the execution of the mappers and reducers. HADOOP CLUSTER ARCHITECTURE namenode namenode daemon job submission node jobtracker 33 tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system tasktracker t kt k datanode daemon Linux file system … slave node slave node … slave node … Figure 2. The number of reduce tasks is equal to the number of reducers speciﬁed by the programmer. Typically. and many slave nodes (three shown here). the architecture of a complete Hadoop cluster is shown in Figure 2. which consists of three separate components: the HDFS master (called the namenode). Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for task allocation. The bulk of a Hadoop cluster consists of slave nodes (only three of which are shown in the ﬁgure) that run both a tasktracker. the return acknowledgment of the tasktracker heartbeat contains task allocation information. on the other hand. The number of map tasks. 2. The open source environment and the fact that many organizations already depend on Hadoop for production systems virtually guarantees that more effective solutions will be developed over time.6: Architecture of a complete Hadoop cluster. The HDFS namenode runs the namenode daemon.6 HADOOP CLUSTER ARCHITECTURE Putting everything together. the job submission node (called the jobtracker).

which is a . mappers are Java objects with a Map method (among others). The programming model also guarantees that intermediate keys will be presented to the Reduce method in sorted order. A mapper object is instantiated for every map task by the tasktracker. Once again. called an input split in Hadoop. The Hadoop API provides hooks for programmer-speciﬁed initialization and termination code. Since this occurs in the context of a single object. The alignment of input splits with HDFS block boundaries simpliﬁes task scheduling. it is possible to preserve state across multiple input key-value pairs within the same map task—this is an important property to exploit in the design of MapReduce algorithms. This.7 SUMMARY This chapter provides a basic overview of the MapReduce programming model. starting with its roots in functional programming and continuing with a description of mappers. this property is critical in the design of MapReduce algorithms and will be discussed in the next chapter. 2. etc. partitioners. Each map task is assigned a sequence of input key-value pairs. MAPREDUCE BASICS programmer serves as a hint to the execution framework. actual job execution is a bit more complex. After all key-value pairs in the input split have been processed. After initialization. and combiners. This means that mappers can read in “side data”. the jobtracker tries to take advantage of data locality—if possible. for each intermediate key in the partition (deﬁned by the partitioner). the mapper object provides an opportunity to run programmer-speciﬁed termination code. dictionaries. providing an opportunity to load state. it becomes necessary to stream input key-value pairs across the network. it is possible to preserve state across multiple intermediate keys (and associated values) within a single reduce task. Input splits are computed automatically and the execution framework strives to align them to HDFS block boundaries so that each map task is associated with a single data block. Signiﬁcant attention is also given to the underlying distributed ﬁle system. In scheduling map tasks. The actual execution of reducers is similar to that of the mappers. the execution framework repeatedly calls the Reduce method with an intermediate key and an iterator over all values associated with that key. If it is not possible to run a map task on local data. reducers. In Hadoop. Since large clusters are organized into racks. with far greater intra-rack bandwidth than inter-rack bandwidth. the Map method is called (by the execution framework) on all key-value pairs in the input split. where a hook is provided in the API to run programmer-speciﬁed code. as we will see in the next chapter. but the actual number of tasks depends on both the number of input ﬁles and the number of HDFS data blocks occupied by those ﬁles. so that the mapper will be processing local data. The life-cycle of this object begins with instantiation. static data sources.34 2. will be important in the design of MapReduce algorithms. Each reducer object is instantiated for every reduce task. the execution framework strives to at least place map tasks on a rack which has a copy of the data block. Since these method calls occur in the context of the same Java object. map tasks are scheduled on the slave node that holds the input split. After initialization. Although conceptually in MapReduce one can think of the mapper being applied to all input key-value pairs and the reducer being applied to all values associated with the same key. too.

SUMMARY 35 tightly-integrated component of the MapReduce environment.2. we now turn our attention to the design of MapReduce algorithms.7. Given this basic understanding. .

.

concepts presented here will show up again in Chapter 5 (graph processing) and Chapter 6 (expectation-maximization algorithms). • Which intermediate key-value pairs are processed by a speciﬁc reducer.. which instantiate arrangements of components and speciﬁc techniques designed to handle frequently encountered situations across a variety of problem domains. Nevertheless. primarily through examples. • Which input key-value pairs are processed by a speciﬁc mapper. Beyond that. this also means that any conceivable algorithm that a programmer wishes to develop must be expressed in terms of a small number of rigidly deﬁned components that must ﬁt together in very speciﬁc ways. processes running on separate nodes in a cluster must. Other than embarrassingly-parallel problems. the programmer does have a number of techniques for controlling execution and managing the ﬂow of data in MapReduce. come together—for example. However. It may not appear obvious how a multitude of algorithms can be recast into this programming model. • When a mapper or reducer begins or ﬁnishes. there is only one opportunity for cluster-wide synchronization—during the shufﬂe and sort stage where intermediate key-value pairs are copied from the mappers to the reducers and grouped by key. on which node in the cluster). to distribute partial results from nodes that produced them to the nodes that will consume them.e. These examples illustrate what can be thought of as “design patterns” for MapReduce. Two of these design patterns are used in the scalable inverted indexing algorithm we’ll present later in Chapter 4. over datasets ranging from gigabytes to petabytes. for example: • Where a mapper or reducer runs (i. Within a single MapReduce job. parallel and distributed algorithms in general). In summary. the programmer needs only to implement the mapper. Furthermore. The purpose of this chapter is to provide. the programmer has little control over many aspects of execution. the combiner and the partitioner. they are: . the reducer. mappers and reducers run in isolation without any mechanisms for direct communication. and optionally. Synchronization is perhaps the most tricky aspect of designing MapReduce algorithms (or for that matter. All other aspects of execution are handled transparently by the execution framework—on clusters ranging from a single node to a few thousand nodes. a guide to MapReduce algorithm design. at some point in time.37 CHAPTER 3 MapReduce Algorithm Design A large part of the power of MapReduce comes from its simplicity: in addition to preparing the input data.

One must often decompose complex algorithms into a sequence of jobs. The ability to execute user-speciﬁed initialization code at the beginning of a map or reduce task. This chapter explains how various techniques to control code execution and data ﬂow can be applied to design algorithms in MapReduce. 5. The chapter is organized as follows: • Section 3. Similarly. requiring repeated execution until some convergence criteria—graph algorithms in Chapter 5 and expectation-maximization algorithms in Chapter 6 behave in exactly this way. The ability to control the sort order of intermediate keys. The proper use of combiners is discussed in detail. The gold standard. Often. The standard solution is an external (non-MapReduce) program that serves as a “driver” to coordinate MapReduce iterations. is linear scalability: an algorithm running on twice the amount of data should take only twice as long. MAPREDUCE ALGORITHM DESIGN 1. The ability to construct complex data structures as keys and values to store and communicate partial results. .38 3. These two approaches are useful in a large class of problems that require keeping track of joint events across a large number of observations. and therefore the set of keys that will be encountered by a particular reducer. 3. 2. Many algorithms are iterative in nature. of course. 4. The ability to preserve state in both mappers and reducers across multiple input or intermediate keys. It is important to realize that many algorithms cannot be easily expressed as a single MapReduce job. the convergence check itself cannot be easily expressed in MapReduce.2 uses the example of building word co-occurrence matrices on large text corpora to illustrate two common design patterns.1 introduces the important concept of local aggregation in MapReduce and strategies for designing efﬁcient algorithms that minimize the amount of partial results that need to be copied across the network. which we dub “pairs” and “stripes”. The focus is both on scalability—ensuring that there are no inherent bottlenecks as algorithms are applied to increasingly larger datasets—and efﬁciency—ensuring that algorithms do not needlessly consume resources and thereby reducing the cost of parallelization. as well as the “in-mapper combining” design pattern. The ability to control the partitioning of the key space. and therefore the order in which a reducer will encounter particular keys. and the ability to execute user-speciﬁed termination code at the end of a map or reduce task. an algorithm running on twice the number of nodes should only take half as long. which requires orchestrating data so that the output of one job becomes the input to the next. • Section 3.

where pieces of intermediate data are sorted into exactly the order that is required to carry out a series of computations. and memory-backed joins.5 covers the topic of performing joins on relational datasets and presents three different approaches: reduce-side. but with the “order inversion” design pattern.1 LOCAL AGGREGATION In the context of data-intensive distributed processing. LOCAL AGGREGATION 39 • Section 3.1 COMBINERS AND IN-MAPPER COMBINING We illustrate various techniques for local aggregation using the simple word count example presented in Section 2. • Section 3. Since network and disk latencies are relatively expensive compared to other operations. reducers sum up the partial counts to arrive at the ﬁnal count. in Hadoop. 3. local aggregation of intermediate results is one of the keys to efﬁcient algorithms. intermediate results are written to local disk before being sent over the network. In this example. Normally. the single most important aspect of synchronization is the exchange of intermediate results. the aggregate statistic can be computed in the reducer before the individual elements are encountered. reductions in the amount of intermediate data translate into increases in algorithmic efﬁciency. Figure 3.1 repeats the pseudo-code of the basic algorithm. with the term itself as the key and a value of one. The sequencing of computations in the reducer can be recast as a sorting problem. For convenience. a reducer needs to compute an aggregate statistic on a set of elements before individual elements can be processed. 3. In a cluster environment. clever sorting of special key-value pairs enables exactly this. the combiners aggregate term counts across the . Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers—recall that they can be understood as “mini-reducers” that process the output of mappers.3. In MapReduce. Through use of the combiner and by taking advantage of the ability to preserve state across multiple inputs. This may seem counter-intuitive: how can we compute an aggregate statistic on a set of elements before encountering elements of that set? As it turns out.3 shows how co-occurrence counts can be converted into relative frequencies using a pattern known as “order inversion”. We call this technique “value-to-key conversion”.1.4. map-side. The ﬁrst technique for local aggregation is the combiner. it is often possible to substantially reduce both the number and size of key-value pairs that need to be shufﬂed from the mappers to the reducers. with the exception of embarrassingly-parallel problems. Often.2.1. this necessarily involves transferring data over the network.4 provides a general solution to secondary sorting. already discussed in Section 2. this would require two passes over the data. • Section 3. from the processes that produced them to the processes that will ultimately consume them. which is quite simple: the mapper emits an intermediate key-value pair for each term observed. Furthermore. which is the problem of sorting values associated with a key in the reduce phase.

where m is the number of mappers and V is the vocabulary size (number of unique terms in the collection). count sum) Figure 3. this can yield substantial savings in the number of intermediate key-value pairs emitted. we initialize an associative array for holding term counts. terms that occur only once will by deﬁnition only be observed by one mapper). Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document. . documents processed by each map task. Prior to processing any input key-value pairs. which is an API hook for user-speciﬁed code. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair).. However.e. In this case. a document about dogs is likely to have many occurrences of the word “dog”). the mapper’s Initialize method is called. counts [c1 . especially for long documents. doc d) for all term t ∈ doc d do Emit(term t. if the combiners take advantage of all opportunities for local aggregation. The workings of this algorithm critically depends on the details of how map and reduce tasks in Hadoop are executed. this version emits a key-value pair for each unique term in the document. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. An associative array (i. most terms will not be observed by most mappers (for example. On the other hand.2 (the mapper is modiﬁed but the reducer remains the same as in Figure 3. the algorithm would generate at most m × V intermediate key-value pairs. as illustrated in the variant of the word count algorithm in Figure 3. Given that some words appear frequently within a document (for example.6. count 1) class Reducer method Reduce(term t.1 and therefore is not repeated). discussed in Section 2.3). .40 3. we can continue to accumulate partial term counts in the associative array across multiple documents. which is responsible for processing a block of input key-value pairs. . a ( Java) mapper object is created for each map task. . combiners in Hadoop are treated as optional optimizations. This results in a reduction in the number of intermediate key-value pairs that need to be shufﬂed across the network—from the order of total number of terms in the collection to the order of the number of unique terms in the collection. so there is no guarantee that the execution framework will take advantage of all opportunities for partial aggregation. there are two additional factors to consider.]) sum ← 0 for all count c ∈ counts [c1 . only the mapper is modiﬁed).] do sum ← sum + c Emit(term t. This basic idea can be taken one step further. c2 .1 An improvement on the basic algorithm is shown in Figure 3. c2 .1: Pseudo-code for the basic word count algorithm in MapReduce (repeated from Figure 2. . . and emit key-value pairs only 1 More precisely.3 (once again. . Due to the Zipﬁan nature of term distributions. since every term could have been observed in every mapper. Recall.

we are in essence incorporating combiner functionality directly inside the mapper. Reducer is the same as in Figure 3.3: Pseudo-code for the improved MapReduce word count algorithm that demonstrates the “in-mapper combining” design pattern. Recall that this API hook provides an opportunity to execute user-speciﬁed code after the Map method has been applied to all input key-value pairs of the input data split to which the map task was assigned. However. it makes almost no difference either way. doc d) for all term t ∈ doc d do H {t} ← H {t} + 1 method Close for all term t ∈ H do Emit(term t. Hadoop makes no guarantees on how many times the combiner is applied. 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Initialize H ← new AssociativeArray method Map(docid a. We’ll see later on how this pattern can be applied to a variety of problems. so that we can refer to the pattern more conveniently throughout the book. For example. in practice. The 2 Leaving aside the minor complication that in Hadoop.3. the semantics of the combiner is underspeciﬁed in MapReduce. combiners can be run in the reduce phase also (when merging intermediate key-value pairs from different map tasks). or that it is even applied at all. LOCAL AGGREGATION 1: 2: 3: 4: 5: 6: 7: 41 class Mapper method Map(docid a. . Reducer is the same as in Figure 3. With this technique. since all opportunities for local aggregation are already exploited.2: Pseudo-code for the improved MapReduce word count algorithm that uses an associative array to aggregate term counts on a per-document basis. emission of intermediate data is deferred until the Close method in the pseudo-code.1.1. “in-mapper combining”. doc d) H ← new AssociativeArray for all term t ∈ doc d do H {t} ← H {t} + 1 for all term t ∈ H do Emit(term t. count H {t}) Tally counts across documents Figure 3. There are two main advantages to using this design pattern: First.1. There is no need to run a separate combiner. In contrast. count H {t}) Tally counts for entire document Figure 3. when the mapper has processed all documents. it provides control over when local aggregation occurs and how it exactly takes place. That is.2 This is a sufﬁciently common design pattern in MapReduce that it’s worth giving it a name.

intermediate key-value pairs are still generated on a per-document basis. drawbacks to the in-mapper combining pattern. only to be “compacted” by the combiners. accurately models the growth of vocabulary size as a function of the collection size—the somewhat surprising fact is that the vocabulary size never stops growing. p. In some cases (although not in this particular example).2. One reason for this is the additional overhead associated with actually materializing the key-value pairs. perhaps multiple times. object serialization and deserialization (when intermediate key-value pairs ﬁll the in-memory buffer holding map outputs and need to be temporarily spilled to disk).3 Therefore. since it is theoretically possible that a mapper encounters every term in the collection. Ultimately. since state is being preserved across multiple input key-value pairs. Second. It critically depends on having sufﬁcient memory to store intermediate results until the mapper has completely processed all key-value pairs in an input split. since pragmatic concerns for efﬁciency often trump theoretical “purity”. which has the option of using it. it breaks the functional programming underpinnings of MapReduce. or not at all (or even in the reduce phase). but there are practical consequences as well. and furthermore. a well-known result in information retrieval. the memory footprint is bound by the vocabulary size. one will often want to increase the input split size to limit the growth of the number of map tasks (in order to reduce the number of distinct copy operations necessary to shufﬂe intermediate data over the network). which is exactly why programmers often choose to perform their own local aggregation in the mappers. This is straightforwardly implemented with a counter variable that keeps track of the number of input key-value pairs that have been 3 In more detail. This creates the potential for ordering-dependent bugs. The idea is simple: instead of emitting intermediate data only after every key-value pair has been processed. First.5 ([101]. tokens in the collection.4 One common solution to limiting memory usage when using the in-mapper combining technique is to “block” input key-value pairs and “ﬂush” in-memory data structures periodically. beyond which the associative array holding the partial term counts will no longer ﬁt in memory. where T is the number of 4 A few more details: note what matters is that the partial term counts encountered within a particular input split ﬁts into memory. Combiners reduce the amount of intermediate data that is shufﬂed across the network. Preserving state across multiple input instances means that algorithmic behavior may depend on the order in which input key-value pairs are encountered. In the word count example. however.3 will scale only up to a point. the mappers will generate only those key-value pairs that need to be shufﬂed across the network to the reducers. Heap’s Law. With the algorithm in Figure 3. 81). the algorithm in Figure 3. MAPREDUCE ALGORITHM DESIGN combiner is provided as a semantics-preserving optimization to the execution framework. this isn’t a big deal. Heap’s Law relates the vocabulary size V to the collection size as follows: V = kT b . as collection sizes increase. which are difﬁcult to debug on large datasets in the general case (although the correctness of in-mapper combining for word count is easy to demonstrate). emit partial results after processing every n key-value pairs. in-mapper combining will typically be more efﬁcient than using actual combiners. However.This process involves unnecessary object creation and destruction (garbage collection takes time). but don’t actually reduce the number of key-value pairs that are emitted by the mappers in the ﬁrst place. .42 3. Typical values of the parameters k and b are: 30 ≤ k ≤ 100 and b ∼ 0. Second. there is a fundamental scalability bottleneck associated with the in-mapper combining pattern. There are. such indeterminism is unacceptable. In contrast. with in-mapper combining.

these tasks are all competing for ﬁnite resources. In both approaches.1. Opportunities for aggregation. the correctness of the algorithm cannot depend on computations performed by the combiner or depend on them even being run at all. we do not ﬁlter frequently occurring words: therefore. however. A real-world example might be a large user log from a popular website. and we wish to compute the mean of all integers associated with the same key (rounded to the nearest integer). LOCAL AGGREGATION 43 processed. such that it is not worth the effort to search for an optimal buffer size (personal communication. the mapper could keep track of its own memory footprint and ﬂush intermediate key-value pairs once memory usage has crossed a certain threshold. the reducer input key-value type must match the mapper output key-value type: this implies that the combiner input and output key-value types must match the mapper output key-value type (which is the same as the reducer input keyvalue type). come from having multiple values associated with the same key (whether one uses combiners or employs the in-mapper combining pattern). without local aggregation. As an alternative. In the general case. Jeff Dean).3) that result from a highly skewed (e. local aggregation is effective because many words are encountered multiple times within a map task. Since combiners in Hadoop are viewed as optional optimizations. the reducer can also be used (unmodiﬁed) as the combiner (as is the case with the word count example). In the word count example. one often encounters diminishing returns in performance gains with increasing buffer sizes. but since the tasks are not aware of each other. the mapper may run out of memory. however. Furthermore. With local aggregation (either combiners or in-mapper combining). the distribution of keys themselves. In cases where the reduce computation is both commutative and associative.2 ALGORITHMIC CORRECTNESS WITH LOCAL AGGREGATION Although use of combiners can yield dramatic reductions in algorithm running time.. Consider a simple example: we have a large dataset where input keys are strings and input values are integers. but with too small a value. In MapReduce algorithms.g. in Hadoop physical memory is split between multiple tasks that may be running on a node concurrently. which alleviates the reduce straggler problem. we substantially reduce the number of values associated with frequently occurring terms. and the number of key-value pairs that are emitted by each individual map task. Local aggregation is also an effective technique for dealing with reduce stragglers (see Section 2. Zipﬁan) distribution of values associated with intermediate keys. it is difﬁcult to coordinate resource consumption effectively. where keys represent user ids and values represent some measure of activity such as elapsed time for a particular session—the task would correspond to computing the mean session length on . opportunities for local aggregation may be lost. 3.1. In our word count example. combiners and reducers are not interchangeable. In any MapReduce program. and therefore will likely be a straggler. the extent to which efﬁciency can be increased through local aggregation depends on the size of the intermediate key space.3. either the block size or the memory usage threshold needs to be determined empirically: with too large a value. In practice. the reducer that’s responsible for computing the count of ‘the’ will have a lot more work to do than the typical reducer. care must be taken in applying them. after all.

Mean(3. this approach would not produce the correct result. This algorithm will indeed work. integer ravg ) Figure 3. the mean of means of arbitrary subsets of a set of numbers is not the same as the mean of the set of numbers. . . integer r) class Reducer method Reduce(string t. Therefore. such a scenario is highly unlikely. which would be useful for understanding user demographics. r2 . and the reducer would compute the mean of those values. r2 . 5)) In general. . However. .5. 3. As a concrete example. . one special case in which using reducers as combiners would produce the correct result: if each combiner computed the mean of equal-size subsets of the values. We use an identity mapper. since such ﬁne-grained control over the combiners is impossible in MapReduce.44 3. a per-user basis. 4.]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . This information is used to compute the mean once all values are processed. 2). Unlike in the word count example.1: it requires shufﬂing all key-value pairs from mappers to reducers across the network. 4.4: Pseudo-code for the basic MapReduce algorithm that computes the mean of values associated with the same key. which is highly inefﬁcient. from which it computes the sum of those values and 5There is. the reducer cannot be used as a combiner in this case. Figure 3. The mapper remains the same. .4 shows the pseudo-code of a simple algorithm for accomplishing this task that does not involve combiners. . but suffers from the same drawbacks as the basic word count algorithm in Figure 3.5 So how might we properly take advantage of combiners? An attempt is shown in Figure 3. integer r) Emit(string t. The mean is then emitted as the output value in the reducer (with the input string as the key). however.] do sum ← sum + r cnt ← cnt + 1 ravg ← sum/cnt Emit(string t. integers [r1 . The reducer keeps track of the running sum and the number of integers encountered. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Map(string t. Consider what would happen if we did: the combiner would compute the mean of an arbitrary subset of values associated with the same key. which simply passes all input key-value pairs to the reducers (appropriately grouped and sorted). The combiner receives each string and the associated list of integer values. 2. 5) = Mean(Mean(1. but we have added a combiner that partially aggregates results by computing the numeric components necessary to arrive at the mean. we know that: Mean(1.

]) sum ← 0 cnt ← 0 for all pair (s. cnt)) class Reducer method Reduce(string t. (s2 . c1 ). which also must be the same as the mapper output type and the reducer input type. with the same string as the key. this algorithm will not work. the count). However. Unfortunately.. . integer r) Emit(string t.1.] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t. LOCAL AGGREGATION 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: 45 class Mapper method Map(string t. integer r) class Combiner method Combine(string t. c2 ) . or Avro. . pair (sum. integers [r1 . We will frequently encounter complex keys and values throughput the rest of this book. . To understand why this restriction is necessary in the programming model. . this represents a key technique in MapReduce algorithm design that we introduced at the beginning of this chapter. . r2 . the number of integers encountered (i. Recall that combiners must have the same input and output key-value type. c2 ) . The mismatch between combiner input and output key-value types violates the MapReduce programming model.]) sum ← 0 cnt ← 0 for all integer r ∈ integers [r1 . integers. .5: Pseudo-code for an incorrect ﬁrst attempt at introducing combiners to compute the mean of values associated with each key. . there are no prohibitions in MapReduce for more complex types. . This is clearly not the case.). in fact. The sum and count are packaged into a pair. So let us remove the combiner and see what happens: the output value type of the 6 In Hadoop.] do sum ← sum + r cnt ← cnt + 1 Emit(string t. c1 ). (s2 . pairs [(s1 . pairs of partial sums and counts can be aggregated to arrive at the mean. Thrift. r2 . integer ravg ) Separate sum and count Figure 3. Up until now. . .e. all keys and values in our algorithms have been primitives (string.6 and. either custom types or types deﬁned using a library such as Protocol Buffers. and emitted as the output of the combiner.3. . In the reducer. remember that combiners are optimizations that cannot change the correctness of the algorithm. etc. c) ∈ pairs [(s1 .

. pair (r. . Let us verify the correctness of this algorithm by repeating the previous exercise: What would happen if no combiners were run? With no combiners. (s2 . mapper is integer. 1)) class Combiner method Combine(string t. integer r) Emit(string t. cnt)) class Reducer method Reduce(string t. .6. The combiner separately aggregates the partial sums and the partial counts (as before). Another stab at the algorithm is shown in Figure 3. pairs [(s1 . c1 ). . But the reducer actually expects a list of pairs! The correctness of the algorithm is contingent on the combiner running on the output of the mappers. the mappers would send pairs (as values) . except that the mean is computed at the end. and more speciﬁcally. so the reducer expects to receive a list of integers as values. . c) ∈ pairs [(s1 . pair (sum. and emits pairs with updated sums and counts. (s2 . c2 ) . Recall from our previous discussion that Hadoop makes no guarantees on how many times combiners are called. c1 ).]) sum ← 0 cnt ← 0 for all pair (s. (s2 .] do sum ← sum + s cnt ← cnt + c Emit(string t.]) sum ← 0 cnt ← 0 for all pair (s. that the combiner is run exactly once. In the mapper we emit as the value a pair consisting of the integer and one—this corresponds to a partial count over one instance. the algorithm is correct. or multiple times. This violates the MapReduce programming model. pairs [(s1 . one. . This algorithm correctly takes advantage of combiners.6: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key. with an additional division at the very end). The reducer is similar to the combiner. c) ∈ pairs [(s1 . . integer ravg ) Figure 3. c2 ) . this algorithm transforms a non-associative operation (mean of numbers) into an associative operation (element-wise sum of a pair of numbers. In essence. and this time. c1 ). .] do sum ← sum + s cnt ← cnt + c ravg ← sum/cnt Emit(string t. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 1: 2: 3: 4: 5: 6: 7: 8: 1: 2: 3: 4: 5: 6: 7: 8: 9: class Mapper method Map(string t. it could be zero. (s2 . c2 ) . c1 ).46 3. c2 ) .

Inside the mapper.2. but in this case the memory footprint of the data structures for holding intermediate data is likely to be modest.7: Pseudo-code for a MapReduce algorithm that computes the mean of values associated with each key.3. Finally. this section introduces two common design patterns we have dubbed “pairs” and “stripes” that exemplify this strategy. Only the mapper is shown here.6 directly to the reducers. pair (S{t}. There would be as many intermediate pairs as there were input key-value pairs. The reducer would still arrive at the correct sum and count. and hence the mean would be correct. Intermediate key-value pairs are emitted only after the entire input split has been processed.6. similar to before. pair) that is passed from mapper to combiner to reducer. C{t})) Figure 3. We ﬁrst touched on this technique in the previous section. PAIRS AND STRIPES 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 47 class Mapper method Initialize S ← new AssociativeArray C ← new AssociativeArray method Map(string t. . and each of those would consist of an integer and one. making this variant algorithm an attractive option.2 PAIRS AND STRIPES One common approach for synchronization in MapReduce is to construct complex keys and values in such a way that data necessary for a computation are naturally brought together by the execution framework. The reducer is exactly the same as in Figure 3. Moving partial aggregation from the combiner directly into the mapper is subjected to all the trade offs and caveats discussed earlier this section.7. integer r) S{t} ← S{t} + r C{t} ← C{t} + 1 method Close for all term t ∈ S do Emit(term t.e. the reducer can emit ﬁnal key-value pairs of a different type. the partial sums and counts associated with each string are held in memory across input key-value pairs. since the combiners merely aggregate partial sums and counts to pass along to the reducers. the value is a pair consisting of the sum and count. 94]. the reducer is the same as in Figure 3. no matter how many times they run. Building on previously published work [54. Now add in the combiners: the algorithm would remain correct. Note that although the output key-value type of the combiner must be the same as the input key-value type of the reducer. 3.. in the context of “packaging” partial sums and counts in a complex value (i. in Figure 3. we present an even more efﬁcient algorithm that exploits the in-mapper combining pattern. illustrating the in-mapper combining design pattern.

though in the general case relations between words need not be symmetric.. For example. Beyond text processing. Another technique involves replacing numeric digits with #. an intelligence analyst might wish to identify associations between re-occurring ﬁnancial transactions that are otherwise unrelated. A cell mij contains the number of times word wi co-occurs with word wj within a speciﬁc context—a natural unit such as a sentence. One common strategy is to replace all rare words (below a certain frequency) with a “special” token such as <UNK> (which stands for “unknown”) to model out-of-vocabulary words.48 3.##). For example.. automatic thesaurus construction [137] and stemming [157]).. e. the co-occurrence matrix of a corpus is a square n × n matrix where n is the number of unique words in the corpus (i. it is clear that there are inherent scalability limitations. in the case where the matrix is too big to ﬁt in memory. problems in many application domains share similar characteristics. MAPREDUCE ALGORITHM DESIGN As a running example.19 both map to the same token (#. and other related ﬁelds such as text mining. this problem represents a speciﬁc instance of the task of estimating distributions of discrete joint events from a large number of observations.g. concepts presented here are also used in Chapter 6 when we discuss expectation-maximization algorithms. 7The size of the vocabulary depends on the deﬁnition of a “word” and techniques (if any) for corpus pre-processing. a large body of work in lexical semantics based on distributional proﬁles of words. or a certain window of m words (where m is an application-dependent parameter). where n is the size of the vocabulary. Formally. a common task in corpus linguistics and statistical natural language processing. Similarly.g. for computing statistics such as pointwise mutual information [38].32 and 1. We describe two MapReduce algorithms for this task that can scale to large corpora. and more generally. paragraph. a naïve implementation on a single machine can be very slow as memory is paged to disk.g. The task also has applications in information retrieval (e. or even billions of words in web-scale collections.e. a very common task in statistical natural language processing for which there are nice MapReduce solutions. for unsupervised sense clustering [136]. It is obvious that the space requirement for the word co-occurrence problem is O(n2 ). a co-occurrence matrix M where mij is the count of how many times word wi was immediately succeeded by word wj would usually not be symmetric. Indeed. which would assist in inventory management and product placement on store shelves. More importantly. or a document. dating back to Firth [55] and Harris [69] in the 1950s and 1960s. the vocabulary size). Note that the upper and lower triangles of the matrix are identical since co-occurrence is a symmetric relation. This task is quite common in text processing and provides the starting point to many other algorithms. we focus on the problem of building word co-occurrence matrices from large corpora. customers who buy this tend to also buy that). a large retailer might analyze point-of-sale transaction records to identify correlated product purchases (e. . which for real-world English corpora can be hundreds of thousands of words. Although compression techniques can increase the size of corpora for which word co-occurrence matrices can be constructed on a single machine. which might provide a clue in thwarting terrorist activity.7 The computation of the word cooccurrence matrix is quite simple if the entire matrix ﬁts into memory—however. The algorithms discussed in this section could be adapted to tackle these related problems. such that 1..

its context). The neighbors of a word can either be deﬁned in terms of a sliding window or some other contextual unit such as a sentence. values in the stripes approach are more complex. combiners with the stripes approach have more opportunities to perform local aggregation because the key space is the vocabulary—associative arrays can be merged whenever a word is encountered multiple times by a mapper.e. Thus. Both algorithms can beneﬁt from the use of combiners. the key space in the pairs approach is the cross of the vocabulary with itself.8. However. is presented in Figure 3. where each associative array encodes the co-occurrence counts of the neighbors of a particular word (i. co-occurrence information is ﬁrst stored in an associative array. and therefore the execution framework has less sorting to perform.e. dubbed the “pairs” approach. . The mapper processes each input document and emits intermediate key-value pairs with each co-occurring word pair as the key and the integer one (i.3. as in the stripes case). The mapper emits keyvalue pairs with words as keys and corresponding associative arrays as values. and the inner loop iterates over all neighbors of the ﬁrst word (the right element in the pair). dubbed the “stripes” approach.9. Each pair corresponds to a cell in the word co-occurrence matrix. which is then emitted as the ﬁnal key-value pair. An alternative approach. However. In contrast to the pairs approach. accumulating counts that correspond to the same cell in the co-occurrence matrix. This algorithm illustrates the use of complex keys in order to coordinate distributed computations. The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. The ﬁnal associative array is emitted with the same word as the key. The stripes representation is much more compact. the count) as the value.. The stripes approach also generates fewer and shorter intermediate keys. denoted H . since the respective operations in their reducers (addition and element-wise sum of associative arrays) are both commutative and associative. The MapReduce execution framework guarantees that all associative arrays with the same key will be brought together in the reduce phase of processing. in this case the reducer simply sums up all the values associated with the same co-occurring word pair to arrive at the absolute count of the joint event in the corpus. each ﬁnal key-value pair encodes a row in the co-occurrence matrix. is shown in Figure 3. As usual. In contrast.This is straightforwardly accomplished by two nested loops: the outer loop iterates over all words (the left element in the pair). co-occurring word pairs are generated by two nested loops.2.. the major difference is that instead of emitting intermediate key-value pairs for each co-occurring word pair. document ids and the corresponding contents make up the input key-value pairs. and come with more serialization and deserialization overhead than with the pairs approach. Like the pairs approach. However. since with pairs the left element is repeated for every co-occurring word pair. The reducer performs an element-wise sum of all associative arrays with the same key. It is immediately obvious that the pairs algorithm generates an immense number of keyvalue pairs compared to the stripes approach. which is far larger—counts can be aggregated only when the same co-occurring word pair is observed multiple times by an individual mapper (which is less likely than observing multiple occurrences of a word. PAIRS AND STRIPES 49 Pseudo-code for the ﬁrst algorithm.

.] do s ←s+c Emit(pair p. stripes [H1 . . H3 . . count 1) class Reducer method Reduce(pair p.8: Pseudo-code for the “pairs” approach for computing word co-occurrence matrices from large corpora. . counts [c1 . .] do Sum(Hf . c2 . u). .]) s←0 for all count c ∈ counts [c1 . H2 . . H3 . H ) Element-wise sum Emit(term w. MAPREDUCE ALGORITHM DESIGN 1: 2: 3: 4: 5: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a.50 3. stripe Hf ) Figure 3. doc d) for all term w ∈ doc d do H ← new AssociativeArray for all term u ∈ Neighbors(w) do H {u} ← H {u} + 1 Tally words co-occurring with w Emit(Term w. . count s) Emit count for each co-occurrence Sum co-occurrence counts Figure 3.]) Hf ← new AssociativeArray for all stripe H ∈ stripes [H1 . Stripe H ) class Reducer method Reduce(term w. doc d) for all term w ∈ doc d do for all term u ∈ Neighbors(w) do Emit(pair (w. . . . . .9: Pseudo-code for the “stripes” approach for computing word co-occurrence matrices from large corpora. H2 . 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: class Mapper method Map(docid a. c2 .

On the other hand. the above caveats remain: there will be far fewer opportunities for partial aggregation in the pairs approach due to the sparsity of the intermediate key space. the corpus was ﬁrst preprocessed as follows: All XML markup was removed. . memory paging will signiﬁcantly impact performance. which approach is faster? Here. However. After the combiners. on the other hand. necessitating some mechanism to periodically emit key-value pairs (which further limits opportunities to perform partial aggregation). we present previously published results [94] that empirically answered this question. The stripes approach makes the assumption that. followed by tokenization and stopword removal using standard tools from the Lucene search engine. For common terms. which quantiﬁes the amount of intermediate data transferred across the network. Similarly. In the end. does not suffer from this limitation. memory management will also be more complex than in the simple word count example. These experiments were performed on a Hadoop cluster with 19 slave nodes.8 million key-value pairs remained.8 Prior to working with Hadoop. Given this discussion. each associative array is small enough to ﬁt into memory—otherwise. After the combiners. necessitating some mechanism to periodically ﬂush in-memory structures.10 compares the running time of the pairs and stripes approach on different fractions of the corpus.2 GB. this will become an increasingly pressing issue—perhaps not for gigabyte-sized corpora. only 28.1 billion key-value pairs.69 million ﬁnal key-value pairs (the number of rows in the co8This was a subset of the English Gigaword corpus (version 3) distributed by the Linguistic Data Consortium (LDC catalog number LDC2007T07). the mappers in the stripes approach generated 653 million intermediate key-value pairs totaling 48. the reducers emitted a total of 142 million ﬁnal key-value pairs (the number of non-zero cells in the co-occurrence matrix). but certainly for terabyte-sized and petabyte-sized corpora that will be commonplace tomorrow.27 million documents from the Associated Press Worldstream (APW) totaling 5. at any point in time.7). which is itself unbounded with respect to corpus size (recall the previous discussion of Heap’s Law).6 billion intermediate key-value pairs totaling 31.3. Figure 3. The reducers emitted a total of 1. Results demonstrate that the stripes approach is much faster than the pairs approach: 666 seconds (∼11 minutes) compared to 3758 seconds (∼62 minutes) for the entire corpus (improvement by a factor of 5. PAIRS AND STRIPES 51 For both algorithms. since it does not need to hold intermediate data in memory. each with 2 single-core processors and 2 disks. It is important to consider potential scalability bottlenecks of either algorithm. the in-mapper combining optimization discussed in the previous section can also be applied. The mappers in the pairs approach generated 2. All tokens were then replaced with unique integers for a more efﬁcient encoding. since the mapper may run out of memory to store partial counts before all documents are processed. The pairs approach. with a co-occurrence window size of two.1 GB.2. The size of the associative array is bounded by the vocabulary size. this was reduced to 1.7 GB. The sparsity of the key space also limits the effectiveness of in-memory combining. for the stripes approach. the associative array may grow to be quite large. We implemented both algorithms in Hadoop and applied them to a corpus of 2. Therefore. as the sizes of corpora increase. the modiﬁcation is sufﬁciently straightforward that we leave the implementation as an exercise for the reader.

the stripes approach provided more opportunities for combiners to aggregate intermediate results. The drawback of absolute counts is that it doesn’t take into account the fact that some words appear more frequently than others. thus greatly reducing network trafﬁc in the shufﬂe and sort phase. via hashing). from 20 slave “small” instances all the way up to 80 slave “small” instances (along the x-axis).52 3. In the case of b = 1. data mining. it is worth noting that the pairs and stripes approaches represent endpoints along a continuum of possibilities. so that words co-occurring with wi would be divided into b smaller “sub-stripes”. These experiments were made possible by Amazon’s EC2 service. (wi . (wi . doubling the cluster size makes the job twice as fast). An additional series of experiments explored the scalability of the stripes approach along another dimension: the size of the cluster.1). 1). Figure 3. these two design patterns are broadly useful and frequently observed in a variety of applications. MAPREDUCE ALGORITHM DESIGN occurrence matrix). cell mij contains the number of times word wi co-occurs with word wj within a speciﬁc context. and bioinformatics. This general description captures the gist of many algorithms in ﬁelds as diverse as text processing. The pairs approach individually records each co-occurring event. As expected. This is conﬁrmed by a linear regression with an R 2 value close to one. .11 (left) shows the running time of the stripes algorithm (on the same corpus. while the stripes approach records all co-occurring events with respect a conditioning event. and the user is charged only for the instance-hours consumed. refer back to our discussion of utility computing in Section 1. since each of the sub-stripes would be smaller. with respect to the 20-instance cluster.e. For this reason. Running times are shown with solid squares. which yields an R 2 value close to one. 3. To conclude. These results show highly desirable linear scaling characteristics (i. Recall that in this large square n × n matrix. where n = |V | (the vocabulary size). Figure 3. Word . b). In the case of b = |V |. Virtualized computational units in EC2 are called instances.3 COMPUTING RELATIVE FREQUENCIES Let us build on the pairs and stripes algorithms presented in the previous section and continue with our running example of constructing the word co-occurrence matrix M for a large corpus. with same setup as before). on varying cluster sizes. This is conﬁrmed by a linear regression applied to the running time data.g.11 (right) recasts the same results to illustrate scaling characteristics. (wi . A middle ground might be to record a subset of the co-occurring events with respect to a conditioning event. The circles plot the relative size and speedup of the EC2 experiments. This would be a reasonable solution to the memory limitations of the stripes approach. .. the pairs and stripes algorithms represent two different approaches to counting co-occurring events from a large number of observations. this is equivalent to the standard stripes approach. 2) . which allows users to rapidly provision clusters of varying sizes for limited durations (for more information. associated with ten separate keys. Figure 3. We might divide up the entire vocabulary into b buckets (e. Viewed abstractly. this is equivalent to the pairs approach.10 also shows that both algorithms exhibit highly desirable scaling characteristics—linear in the amount of input data. where |V | is the vocabulary size..

each with 2 single-core processors and 2 disks.10: Running time of the “pairs” and “stripes” algorithms for computing word co-occurrence matrices on different fractions of the APW corpus. Scaling characteristics (relative speedup) in terms of increasing Hadoop cluster size (right).3.3.992 500 0 0 20 40 60 Percentage of the APW corpus 80 100 "stripes" approach "pairs" approach R2 = 0. These experiments were performed on a Hadoop cluster with 19 slaves. COMPUTING RELATIVE FREQUENCIES 53 4000 3500 Running time (seconds) 3000 2500 2000 1500 1000 R2 = 0. .997 3000 3x 2000 2x 1000 1x 0 10 20 30 40 50 60 70 80 Size of EC2 cluster (number of slave instances) 90 1x 2x 3x Relative size of EC2 cluster 4x Figure 3.999 Figure 3.11: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different sizes from EC2 (left). 5000 4000 Running time (seconds) Relative speedup 4x 2 R = 0.

N (·. A simple remedy is to convert absolute counts into relative frequencies. we can easily detect if all pairs associated with the word we are conditioning on (wi ) have been encountered. we can buffer in memory all the words that co-occur with wi and their counts. for example.. ·) indicates the number of times a particular co-occurring word pair is observed in the corpus. but it suffers from the same drawback as the stripes approach: as the size of the corpus grows. For a complex key. Given this ordering. the raw byte representation is used to compute the hash value. We must ensure that all pairs with the same left word are sent to the same reducer. wj ) as the key and the count as the value. there is no guarantee that. Inside the reducer. so does that vocabulary size. divided by what is known as the marginal (the sum of the counts of the conditioning variable co-occurring with anything else).1) Here. How might one compute relative frequencies with the pairs approach? In the pairs approach. At that point we can go back through the in-memory buffer.e. To produce the desired behavior. we must deﬁne a custom partitioner that only pays attention to the left word. the reducer can preserve state across multiple keys. (dog. w )). the reducer receives (wi .9. counts of all words that co-occur with the conditioning variable (wi in the above example) are available in the associative array. Therefore. As a result. Computing relative frequencies with the stripes approach is straightforward. this algorithm also assumes that each associative array ﬁts into memory. aardvark) and (dog. compute the relative frequencies. and illustrates the use of complex data structures to coordinate distributed computations in MapReduce. and at some point there will not be sufﬁcient memory to store all co-occurring words and their counts for the word we are conditioning . This algorithm will indeed work. Note that. and then divide all the joint counts by the marginal to arrive at the relative frequency for all words. f (wj |wi ). we must deﬁne the sort order of the pair so that keys are ﬁrst sorted by the left word. wj ) w N(wi . w N (wi . zebra) are assigned to the same reducer. This implementation requires minimal modiﬁcation to the original stripes algorithm in Figure 3. That is. There is one more modiﬁcation necessary to make this algorithm work. does not happen automatically: recall that the default partitioner is based on the hash value of the intermediate key. That is. In the reducer. the partitioner should partition based on the hash of the left word only. Fortunately. To make this work. Through appropriate structuring of keys and values. w ) (3. This. MAPREDUCE ALGORITHM DESIGN wi may co-occur frequently with wj simply because one of the words is very common. modulo the number of reducers. From this alone it is not possible to compute f (wj |wi ) since we do not have the marginal. unfortunately. as before. in essence building the associative array in the stripes approach. and then by the right word. as in the mapper. We need the count of the joint event (word co-occurrence). one can use the MapReduce execution framework to bring together all the pieces of data required to perform a computation.54 3. it sufﬁces to sum all those counts to arrive at the marginal (i. and then emit those results in the ﬁnal key-value pairs. what proportion of the time does wj appear in the context of wi ? This can be computed using the following equation: f (wj |wi ) = N(wi .

w ) = 42908 f (aardvark|dog) = 3/42908 f (aardwolf|dog) = 1/42908 f (zebra|dog) = 5/42908 compute marginal: w N(doge. To compute relative frequencies.1] [682. ∗). aardvark) (dog. the reducer can directly compute the relative frequencies. Is there a way to modify the basic pairs approach so that this advantage is retained? As it turns out. on.1.] compute marginal: w N(dog. The . ∗) and a number of values. . with a value of one. . ∗) are ordered before any other key-value pairs where the left word is wi . If it were possible to somehow compute (or otherwise obtain access to) the marginal in the reducer before processing the joint counts. ∗) (dog. . w ) = 1267 Figure 3. such an algorithm is indeed possible. This is accomplished by deﬁning the sort order of the keys so that pairs with the special symbol of the form (wi . zebra) (doge. each of which represents a partial marginal contribution from the map phase (assume here either combiners or in-mapper combining.3. that represents the contribution of the word pair to the marginal. aardwolf) … (dog. In the reducer. First. which lists the sequence of key-value pairs that a reducer might encounter. With the data properly sequenced. The insight lies in properly sequencing data presented to the reducer. However. so the values represent partially aggregated counts).12. as before we must also properly deﬁne the partitioner to pay attention to only the left word in each pair. the in-mapper combining pattern can be used to even more efﬁciently aggregate marginal counts. although it requires the coordination of several mechanisms in MapReduce. we must make sure that the special key-value pairs representing the partial marginal contributions are processed before the normal key-value pairs representing the joint counts. The notion of “before” and “after” can be captured in the ordering of key-value pairs.3. COMPUTING RELATIVE FREQUENCIES 55 key (dog. each mapper emits a key-value pair with the co-occurring word pair as the key. A concrete example is shown in Figure 3. the reducer is presented with the special key (dog. Through use of combiners. which can be explicitly controlled by the programmer. That is. In addition. 8514. This illustrates the application of the order inversion design pattern. . Recall that in the basic pairs algorithm.1. the reducer could simply divide the joint counts by the marginal to compute the relative frequencies.1] [1] [2. For computing the co-occurrence matrix. . we still need to compute the marginal counts. ∗) … values [6327. . the programmer can deﬁne the sort order of keys so that data needed earlier is presented to the reducer before data that is needed later. we modify the mapper so that it additionally emits a “special” key of the form (wi . Alternatively.12: Example of the sequence of key-value pairs presented to the reducer in the pairs algorithm for computing relative frequencies.] [2. these partial marginal counts will be aggregated before being sent to the reducers. the advantage of the pairs approach is that it doesn’t suffer from any memory bottlenecks.

Associated with this key will be a list of values representing partial joint counts from the map phase (two separate values in this case). • Preserving state across multiple keys in the reducer to ﬁrst compute the marginal based on the special key-value pairs and then dividing the joint counts by the marginals to arrive at the relative frequencies. the reducer resets its internal state and starts to accumulate the marginal all over again. the speciﬁc application of the order inversion design pattern for computing relative frequencies requires the following: • Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its contribution to the marginal. the reducer will encounter a series of keys representing joint counts. ∗). As we will see in Chapter 4. we can access the result of a computation in the reducer (for example.e. When the reducer encounters the next special key-value pair (doge. w ). All subsequent joint counts are processed in exactly the same manner. After (dog. i. w N(dog. The key insight is to convert the sequencing of computations into a sorting problem. simple arithmetic sufﬁces to compute the relative frequency. the number of times dog and aardvark co-occur in the entire collection. In most cases. This design pattern. an algorithm requires data in some ﬁxed order: by controlling how keys are sorted and how the key space is partitioned. we can present data to the reducer in the order necessary to perform the proper computations. this design pattern is also used in inverted index construction to properly set compression parameters for postings lists. No buffering of individual co-occurring word counts is necessary. aardvark). occurs surprisingly often and across applications in many domains. an aggregate statistic) before processing the data needed for that computation. Summing these counts will yield the ﬁnal joint count. which we call “order inversion”. MAPREDUCE ALGORITHM DESIGN reducer accumulates these counts to arrive at the marginal. let’s say the ﬁrst of these is the key (dog. This greatly cuts down on the amount of partial results that the reducer needs to hold in memory. At this point. ∗). since the reducer already knows the marginal. since only the marginal (an integer) needs to be stored. To summarize.. • Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co-occurrence counts. • Deﬁning a custom partitioner to ensure that all pairs with the same left word are shufﬂed to the same reducer.56 3. The reducer holds on to this value as it processes subsequent keys. Observe that the memory requirement for this algorithm is minimal. and therefore we have eliminated the scalability bottleneck of the previous algorithm. . It is so named because through proper coordination.

Consider the example of sensor data from a scientiﬁc experiment: there are m sensors each taking readings on a continuous basis.g. However. with the rest of each record as the value: m1 → (t1 .3.. but may be a series of values. SECONDARY SORTING 57 3. there is a general purpose solution. since MapReduce makes no guarantees about the ordering of values associated with the same key.. and let the MapReduce execution framework handle the sorting. r21823 ) (t2 . In the above example.4 SECONDARY SORTING MapReduce sorts intermediate key-value pairs by the keys during the shufﬂe and sort phase. r66508 ) (t2 .4. which we call the “value-to-key conversion” design pattern. (t1 . which guarantees that values arrive in sorted order. we also need to sort by value? Google’s MapReduce implementation provides built-in functionality for (optional) secondary sorting. m1 . m3 .g.g. m2 . Fortunately. by time). we would emit the sensor id and the timestamp as a composite key: . The basic idea is to move part of the value into the intermediate key to form a composite key. which is very convenient if computations inside the reducer rely on sort order (e. Hadoop. or even raw bytes of images). r76042 ) . the sensor readings will not likely be in temporal order. the order inversion design pattern described in the previous section). it should be apparent by now that any in-memory buffering of data introduces a potential scalability bottleneck. where rx after each timestamp represents the actual sensor reading (unimportant for this discussion. However. r80521 ) This would bring all readings from the same sensor together in the reducer. However. The most obvious solution is to buffer all the readings in memory and then sort by timestamp before additional processing.. r14209 ) (t1 . What if we are working with a high frequency sensor or sensor readings over a long period of time? What if the sensor readings themselves are large complex objects? This approach may not scale in these cases—the reducer would run out of memory trying to buffer all values associated with the same key. m1 . one or more complex records. (t2 . unfortunately. does not have this capability built in. by sensor id). since in many applications we wish to ﬁrst group together data one way (e. where m is potentially a large number. instead of emitting the sensor id as the key. m2 . what if in addition to sorting by key. r98347 ) Suppose we wish to reconstruct the activity at each individual sensor over time. m3 . This is a common problem. A dump of the sensor data might look something like the following. and then sort within the groupings another way (e.. A MapReduce program to accomplish this might map over the raw data and emit the sensor id as the intermediate key.. r80521 ) (t1 .

It is entirely possible to implement a disk-based sort within the reducer. holding everything from sales transactions to product inventories. However.10 With value-to-key conversion. Parallel databases 9 Alternatively. t1 ) → (r80521 ) The sensor reading itself now occupies the value. We must deﬁne the intermediate key sort order to ﬁrst sort by the sensor id (the left element in the pair) and then by the timestamp (the right element in the pair). This pattern results in many more keys for the framework to sort.9 The basic tradeoff between the two approaches discussed above (buffer and in-memory sort vs. quaternary. but distributed sorting is a task that the MapReduce runtime excels at since it lies at the heart of the programming model. MAPREDUCE ALGORITHM DESIGN (m1 .. Typically. data warehouses have been implemented through relational databases. Note that this approach can be arbitrarily extended to tertiary. t3 ) → [(r146925 )] . the key-value pairs will be presented to the reducer in the correct sorted order: (m1 .58 3. current. sorting. query logs) as well as unstructured data. which is likely to be faster but suffers from a scalability bottleneck.. Data warehouses form a foundation for business intelligence applications designed to provide decision support. in principle. etc. The reducer will need to preserve state and keep track of when readings associated with the current sensor end and the next sensor begin. In an enterprise setting. . the data is relational in nature. note that sensor readings are now split across multiple keys. It makes more sense to take advantage of functionality that is already present with value-to-key conversion. although one would be duplicating functionality that is already present in the MapReduce execution framework. One can explicitly implement secondary sorting in the reducer. but customers ﬁnd that they often cannot cost-effectively scale to the crushing amounts of data an organization needs to deal with today. but increasingly data warehouses are used to store semistructured data (e. a data warehouse serves as a vast repository of data. It is widely believed that insights gained by mining historical. sorting is ofﬂoaded to the MapReduce execution framework. A number of vendors offer parallel databases. this need not be an in-memory sort. value-to-key conversion) is where sorting is performed.. Traditionally.5 RELATIONAL JOINS One popular application of Hadoop is data-warehousing. t2 ) → [(r21823 )] (m1 . t1 ) → [(r80521 )] (m1 . 10 Note that. 3. Hadoop provides API hooks to deﬁne “groups” of intermediate keys that should be processed together in the reducer. We must also implement a custom partitioner so that all pairs associated with the same sensor are shufﬂed to the same reducer.g. and prospective data can yield competitive advantages in the marketplace. particularly those optimized for a speciﬁc workload known as online analytical processing (OLAP). Properly orchestrated.

Let us suppose that relation S looks something like the following: (k1 . S1 ) (k2 .There is an ongoing debate between advocates of parallel databases and proponents of MapReduce regarding the merits of both approaches for OLAP-type workloads. s3 . We should stress here that even though Hadoop has been applied to process relational data. it makes sense to examine MapReduce algorithms for manipulating relational data. and instead focus on discussing algorithms. suppose relation T looks something like this: (k1 . Given successful applications of Hadoop to data-warehousing and complex analytical queries that are prevalent in such an environment. S2 ) (k3 . discussed Facebook’s experiences with scaling up business intelligence applications with Oracle databases.vertica. T1 ) (k3 . sn is a unique id for the tuple. s1 .. However. Dewitt and Stonebraker. t2 . for example. and the Sn after sn denotes other attributes in the tuple (unimportant for the purposes of the join). which they ultimately abandoned in favor of a Hadoop-based solution developed in-house called Hive (which is now an open-source project).com/database-innovation/mapreduce-a-major-step-backwards/ . We shall refrain here from participating in this lively debate. Similarly. It was originally developed by Yahoo. Hadoop has gained popularity as a platform for data-warehousing.3. This section presents three different strategies for performing relational joins on two datasets (relations).5. it is highly unlikely that an analyst interacting with a data warehouse would ever be called upon to write MapReduce programs (and indeed. it is instructive to understand the algorithms that underlie basic relational operations. Pig [114] is a platform for massive data analytics built on Hadoop and capable of handling structured as well as semi-structured data. s2 . 144]. From an application point of view.. see Dean and Ghemawat’s counterarguments [47] and recent attempts at hybrid architectures [1]. they ran a series of benchmarks that demonstrated the supposed superiority of column-oriented parallel databases over Hadoop [120. Hadoop is not a database. but is now also an open-source project. Nevertheless. generically named S and T . where k is the key we would like to join on. Hammerbacher [68]. RELATIONAL JOINS 59 are often quite expensive—on the order of tens of thousands of dollars per terabyte of user data. Hadoopbased systems such as Hive and Pig present a much higher-level language for interacting with large amounts of data). T2 ) 11 http://databasecolumn. This section focuses speciﬁcally on performing relational joins in MapReduce. famously decried MapReduce as “a major step backwards” in a controversial blog post. Over the past few years. two well-known ﬁgures in the database community.11 With colleagues. S3 ) . t1 .

we present one realistic scenario: S might represent a collection of user proﬁles. (s81 . the algorithm sketched above will work ﬁne. where at most one tuple from S and one tuple from T share the same join key (but it may be the case that no tuple from S shares the join key with a tuple from T . T84 )] → [(s68 .e. T . so the ﬁrst value might be from S or from T . The ﬁrst and simplest is a one-to-one join. or vice versa). (t84 . . Joining these two datasets would allow an analyst. all tuples will be grouped by the join key—which is exactly what we need to perform the join operation. T3 ) . then we know that one must be from S and the other must be from T . t3 . this means that no tuple in the other dataset shares the join key. → [(s64 . The k in these tuples could be interpreted as the foreign key that associates each individual page view with a user. Each tuple might correspond to a page view of a particular URL and may contain additional information such as time spent on the page. 12 Not very important if the intermediate data is compressed. The reducer will be presented keys and lists of values along the lines of the following: k23 k37 k59 k61 .. The idea is quite simple: we map over both datasets and emit the join key as the intermediate key. MAPREDUCE ALGORITHM DESIGN (k8 . and the tuple itself as the intermediate value. In this case. ﬁlter by some other attribute. This approach is known as a parallel sort-merge join in the database community [134]. T97 ). compute aggregates. for example. In more detail.5. etc. S64 ). we can remove it from the value to save a bit of space. S68 )] → [(t97 .1 REDUCE-SIDE JOIN The ﬁrst approach to relational joins is what’s known as a reduce-side join.. gender. S81 )] → [(t99 . However.g. user id)..12 If there are two values associated with a key. in which case k could be interpreted as the primary key (i. and the Tn after tn denotes other attributes in the tuple. tn is a unique id for the tuple. where k is the join key. no guarantees are made about value ordering. T99 )] Since we’ve emitted the join key as the intermediate key. The other dataset. 3. recall that in the basic MapReduce programming model. etc. Since MapReduce guarantees that all values with the same key are brought together. ad revenue generated.. might represent logs of online activity..60 3.. The tuples might contain demographic information such as age. to break down online activity in terms of demographics. income. To make this task more concrete. there are three different cases to consider. We can proceed to join the two tuples and perform additional computations (e. so the reducer does nothing. If there is only one value associated with a key.). etc.

s105 ) → [(S105 )] (k82 . k is the primary key in S). t137 ) → [(T137 )] . and then cross it with every tuple from T to perform the join. After applying the value-to-key conversion design pattern. we can remove them from the value to save a bit of space.3. and the solution lies in the value-to-key conversion design pattern we just presented. s105 ) → [(S105 )] (k82 .. . The easiest solution is to buffer all values in memory..e. so that all composite keys with the same join key arrive at the same reducer. s124 ) → [(S124 )] . t101 ) → [(T101 )] (k82 . Since both the join key and the tuple id are present in the intermediate key.5. Since the MapReduce execution framework performs the sorting. Two additional changes are required: First. as we have seen several times already.13 Whenever the reducer encounters a new join key. Consider what happens at the reducer: (k82 . t98 ) → [(T98 )] (k82 .. Assume that tuples in S have unique join keys (i. but when processing each key in the reducer.. it is guaranteed that the associated value will be the relevant tuple from S. since values are arbitrarily ordered. we instead create a composite key consisting of the join key and the tuple id (from either S or T ). this creates a scalability bottleneck since we may not have sufﬁcient memory to hold all the tuples with the same join key. t101 ) → [(T101 )] (k82 . t98 ) → [(T98 )] (k82 . Assuming that S is the smaller dataset. the reducer will be presented with keys and values along the lines of the following: (k82 . the above algorithm works as well.. The above algorithm will still work. we must deﬁne the partitioner to pay attention to only the join key. let us consider the many-to-many join case. RELATIONAL JOINS 61 Let us now consider the one-to-many join.. and then sort all tuple ids from S before all tuple ids from T .. Second. However. The reducer can hold this tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new join key is encountered). 13 Once again. Thus. we have eliminated the scalability bottleneck. Finally. we must deﬁne the sort order of the keys to ﬁrst sort by the join key. instead of simply emitting the join key as the intermediate key. In the mapper. This is a problem that requires a secondary sort. so that S is the “one” and T is the “many”. we have no idea when the value corresponding to the tuple from S will be encountered. there is no need to buffer tuples (other than the single one from S). t137 ) → [(T137 )] . not very important if the intermediate data is compressed. (k82 . pick out the tuple from S.

by using a custom partitioner and controlling the sort order of key-value pairs). In practice. But is it realistic to expect that the stringent conditions required for map-side joins are satisﬁed? In many cases. so this restriction applies to keys the reducers in those jobs may emit. A map-side join is far more efﬁcient than a reduce-side join since there is no need to shufﬂe the datasets over the network.5. the tuples were sorted by the join key. But of course. we map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join. This can be accomplished in parallel.14 No reducer is required. For ad hoc data analysis.62 3. We assume here that the datasets to be joined were produced by previous MapReduce jobs. the second ﬁle with S with the second ﬁle of T . In this case. the datasets that are to be joined may be the output of previous processes (either MapReduce jobs or other code). in the map phase of a MapReduce job—hence. which the reducer can buffer in memory. MAPREDUCE ALGORITHM DESIGN All the tuples from S with the same join key will be encountered ﬁrst. yes.2 MAP-SIDE JOIN Suppose we have two datasets that are both sorted by the join key. albeit less efﬁcient. solution. Consider the case where datasets have multiple keys that one might wish to join on—then no matter how the data is organized. this incurs the cost of shufﬂing data over the network. . partitioned in the same manner by the join key. it is always possible to repartition a dataset using an identity mapper and reducer. We can parallelize this by partitioning and sorting both datasets in the same way. it is crossed with all the tuples from S. If the workﬂow is known in advance and relatively static (both reasonable assumptions in a mature workﬂow). Hadoop permits reducers to emit keys that are different from the input key whose values they are 14 Note that this almost always implies a non-local read. we can engineer the previous processes to generate output sorted and partitioned in a way that makes efﬁcient map-side joins possible (in MapReduce. As the reducer processes each tuple from T . For example. unless the programmer wishes to repartition the output or perform further processing. we are assuming that the tuples from S (with the same join key) will ﬁt into memory. Further suppose that in each ﬁle. 3. This leads us to the map-side join. The basic idea behind the reduce-side join is to repartition the two datasets by the join key. a map-side join. The approach isn’t particularly efﬁcient since it requires shufﬂing both datasets across the network. The reason is that relational joins happen within the broader context of a workﬂow. suppose S and T were both divided into ten ﬁles. which may include multiple steps. map-side joins will require repartitioning of the data. Alternatively. we simply need to merge join the ﬁrst ﬁle of S with the ﬁrst ﬁle of T . There is a ﬁnal restriction to bear in mind when using map-side joins with the Hadoop implementation of MapReduce. Therefore. reduce-side joins are a more general. We can perform a join by scanning through both datasets simultaneously—this is known as a merge join in the database community. etc. which is a limitation of this algorithm (and why we want to control the sort order so that the smaller dataset comes ﬁrst). Of course.

the mapper probes the in-memory dataset to see if there is a tuple with the same join key.1. . . . The simplest version is applicable when one of the two datasets completely ﬁts in memory on each node. The mappers would then query this distributed key-value store in parallel and perform joins if the join keys match. What if neither dataset ﬁts in memory? The simplest solution is to divide the smaller dataset. populating an associative array to facilitate random access to tuples based on the join key. input and output keys need not be the same.6 SUMMARY This chapter provides a guide on the design of MapReduce algorithms.15 However. Mappers are then applied to the other (larger) dataset. recall from Section 2. reducers’ output keys must be exactly same as their input 16 In order to achieve good performance in accessing distributed key-value stores. where the functionality of the combiner is moved into the mapper. This. synchronous requests (to amortize latency over many requests) or to rely on asynchronous requests. if the output key of a reducer is different from the input key. and then run n memory-backed hash joins. nor even the same type). Since map-side joins depend on consistent partitioning and sorting of keys. If there is.6.3 MEMORY-BACKED JOIN In addition to the two previous approaches to joining relational data that leverage the MapReduce framework to bring together tuples that share a common join key. 3. then the output dataset from the reducer will not necessarily be partitioned in a manner consistent with the speciﬁed partitioner (because the partitioner applies to the input keys rather than the output keys).1) can be used for this purpose. it is often necessary to batch queries before making keys. such that S = S1 ∪ S2 ∪ . SUMMARY 63 processing (that is. the reducers used to generate data that will participate in a later map-side join must not emit any key but the one they are currently processing. There is an alternative approach to memory-backed joins for cases where neither datasets ﬁt into memory. and for each input key-value pair. the mapper aggregates 15 In contrast. Instead of emitting intermediate output for every input key-value pair. In summary. A distributed key-value store can be used to hold one dataset in memory across multiple machines while mapping over the other.5.3. The mapper initialization API hook (see Section 3. In this situation. In particular. requires streaming through the other dataset n times. into n partitions. ∪ Sn . of course. there is a family of approaches we call memory-backed joins based on random access probes. This is known as a simple hash join by the database community [51]. this approach is detailed in a technical report [95]. we can load the smaller dataset into memory in every mapper. let’s say S. the join is performed.2 that in Google’s implementation. we present a number of “design patterns” that capture effective solutions to common problems. We can choose n so that each partition is small enough to ﬁt in memory. and therefore we’ve dubbed this approach memcached join. they are: • “In-mapper combining”.16 The open-source caching system memcached can be used for exactly this purpose. 3. For more information.

Constructing complex keys and values that bring together data necessary for a computation. By moving part of the value into the key. in-mapper combining depends on emission of intermediate key-value pairs in the map task termination code. This is used in in-mapper combining. . Although the stripes approach is signiﬁcantly more efﬁcient. MAPREDUCE ALGORITHM DESIGN partial results across multiple input records and only emits intermediate key-value pairs after some amount of local aggregation is performed. where the main idea is to convert the sequencing of computations into a sorting problem. This is used in all of the above design patterns. we will focus on speciﬁc classes of MapReduce algorithms: for inverted indexing in Chapter 4. Through careful orchestration. 2. we can send the reducer the result of a computation (e. Controlling the sort order of intermediate keys. we can exploit the MapReduce execution framework itself for sorting. which provides a scalable solution for secondary sorting. It should be clear by now that although the programming model forces one to express algorithms in terms of a small set of rigidly deﬁned components. In the pairs approach.This is used in order inversion and value-to-key conversion.. we keep track of each joint event separately. This is used in order inversion and value-to-key conversion. 3. In the next few chapters. • The related patterns “pairs” and “stripes” for keeping track of joint events from a large number of observations. order inversion. controlling synchronization in the MapReduce programming model boils down to effective use of the following techniques: 1. Executing user-speciﬁed initialization and termination code in either the mapper or reducer. Preserving state across multiple inputs in the mapper and reducer. For example. 4. • “Order inversion”.g. 5. for graph processing in Chapter 5. whereas in the stripes approach we keep track of all events that co-occur with the same event. and value-to-key conversion. Controlling the partitioning of the intermediate key space. which presents a scalability bottleneck. and for expectation-maximization in Chapter 6.64 3. • “Value-to-key conversion”. an aggregate statistic) before it encounters the data necessary to produce that computation. Ultimately. it requires memory on the order of the size of the event space. there are many tools at one’s disposal to shape the ﬂow of computation. This concludes our overview of MapReduce algorithm design.

PageRank [117]. taking into account features such as term matches. construction of the inverted index (indexing) and ranking documents given a query (retrieval). However... which given a term provides access to the list of documents that contain the term. even for rapidly changing sites. but these are very different from retrieval. but they do not need to operate in real time. users demand results quickly from a search engine—query latencies longer than a few hundred milliseconds will try a user’s patience. since the amount of content that changes rapidly is relatively small. Both need to be scalable and efﬁcient. on the other hand.. PowerPoint slides. or even fragments of code. Some sites (e.1 Retrieval. appears in title. In information retrieval parlance. Given a user query. In real-world applications. Crawling and indexing share similar characteristics and requirements. bold. attributes of the terms in the document (e. PDFs. the retrieval engine uses the inverted index to score documents that contain the query terms with respect to some ranking model.g.g. or related metrics such as HITS [84] and SALSA [88]). which requires different techniques and algorithms.g. Indexing is usually a batch process that runs periodically: the frequency of refreshes and updates is usually dependent on the design of the crawler. How large is the web? It is difﬁcult to compute exactly. etc.) and present them to the user..). news organizations) update their content quite frequently and need to be visited often. running smaller-scale index updates at greater frequencies is usually an adequate solution. it is usually tolerable to have a delay of a few minutes until content is searchable.g. The web search problem decomposes into three components: gathering web content (crawling). . objects to be retrieved are generically called “documents” even though in actuality they may be web pages. Gathering web content and building inverted indexes are for the most part ofﬂine problems.65 CHAPTER 4 Inverted Indexing for Text Retrieval Web search is the quintessential large-data problem. term proximity. the system’s task is to retrieve relevant web objects (web pages. Furthermore. as well as the hyperlink structure of the documents (e. government regulations) are relatively static. other sites (e. which we’ll discuss in Chapter 5. considering the amounts of data involved! Nearly all retrieval engines for full-text search today rely on a data structure called an inverted index. totaling hundreds of terabytes (considering text alone). but even a conservative estimate would place the size at several tens of billions of pages. Given an information need expressed as a short query consisting of a few terms. is an online problem that demands sub-second 1 Leaving aside the problem of searching live data streams such a tweets. PDF documents. Fulﬁlling these requirements is quite an engineering feat. etc.

3. On the other hand. Conceptually.2 Obtaining access to these standard collections is usually as simple as signing an appropriate data license from the distributor of the collection.66 4. We point out a scalability bottleneck in that algorithm. A baseline inverted indexing algorithm in MapReduce is presented in Section 4. Acquiring web content requires crawling. Since MapReduce is primarily designed for batch-oriented processing. and arranging for receipt of the data.g. Individual users expect low query latencies. and can exhibit “spikey” behavior due to special circumstances (e. In academia and for research purposes.5. extracts links from those pages to add to the queue. and later. in the 1990s.1 WEB CRAWLING Before building inverted indexes. and even this entire book. 4. there is the ClueWeb09 collection that contains one billion web pages in ten languages (totaling 25 terabytes) crawled by Carnegie Mellon University in early 2009. which ﬁlls in missing details on building compact index structures. we mostly focus on the problem of inverted indexing. this can be relatively straightforward. it does not provide an adequate solution for the retrieval problem. However.cs. the task most amenable to solutions in MapReduce. Index compression is discussed in Section 4.lti. The crawler downloads pages in the queue. Standard collections for information retrieval research are widely available for a variety of genres ranging from blogs to newswire text. we must ﬁrst acquire the document collection over which these indexes are to be built. resource consumption for the indexing problem is more predictable. one cannot simply assume that the collection is already available.2). This chapter begins by ﬁrst providing an overview of web crawling (Section 4.1) and introducing the basic structure of an inverted index (Section 4. effective and efﬁcient web crawling is far more complex. In fact. A comprehensive treatment of web search is beyond the scope of this chapter. INVERTED INDEXING FOR TEXT RETRIEVAL response time. on DVDs. query loads are highly variable. the process is quite simple to understand: we start by populating a queue with a “seed” list of pages. Explicitly recognizing this. research collections were distributed via postal mail on CD-ROMs. Furthermore.4. rudimentary web crawlers can be written in a few hundred lines of code. many collections today are so large that the only practical method of distribution is shipping hard drives via postal mail.edu/Data/clueweb09/ 3 As an interesting side note.The following lists a number of issues that real-world crawlers must contend with: 2 http://boston. and repeats.. a breaking news event triggers a large number of searches on the same topic). For researchers who wish to explore web-scale retrieval.The chapter concludes with a summary and pointers to additional readings. however. but query throughput is equally important since a retrieval engine must usually serve many users concurrently. . However. depending on the time of day. paying a reasonable fee. which is the process of traversing the web by repeatedly following hyperlinks and storing downloaded pages for subsequent processing.cmu. an issue we discuss in Section 4. which leads to a revised version presented in Section 4. Electronic distribution became common earlier this decade for collections below a certain size. stores the pages for further processing.3 For real-world web search.6.

in the sense that spammers actively create “link farms” and “spider traps” full of spam pages to trick a crawler into overrepresenting content from a particular site. but rather to give the reader an appreciation of the complexities involved in this intuitively simple task. and newswire content that is often duplicated.. Furthermore. • The web is multilingual. For example. it is common practice to wait a ﬁxed amount of time before repeated requests to the same server. • Most real-world web crawlers are distributed systems that run on clusters of machines. • Since a crawler has ﬁnite bandwidth and resources. For more information. and errors of various types. many pages contain a mix of text in different languages. WEB CRAWLING 67 • A web crawler must practice good “etiquette” and not overload web servers. The problem is compounded by the fact that most repetitious pages are not exact duplicates but near duplicates (that is.g. see a recent survey on web crawling [113]. Since document processing techniques (e. a crawler typically keeps many execution threads running in parallel and maintains many TCP connections (perhaps hundreds) open at the same time. There is no guarantee that pages in one language only link to pages in the same language.7 provides suggestions for additional readings. etc. but link to her publications in English.) It is desirable during the crawling process to identify near duplicates and select the best exemplar to index. a professor in Asia may maintain her website in the local language. To avoid downloading a page multiple times and to ensure data consistency. but with different frequency depending on both the site and the nature of the content. Examples include multiple copies of a popular conference paper. tokenization. In order to respect these constraints while maintaining good throughput. Such decisions must be made online and in an adversarial environment. network outages. but not frequent enough leads to stale content. the crawler as a whole needs mechanisms for coordination and load-balancing. • The web is full of duplicate content.1. stemming) differ by language. It also needs to be robust with respect to machine failures. For example. often geographically distributed. The above discussion is not meant to be an exhaustive enumeration of issues. basically the same page but with different ads. Getting the right recrawl frequency is tricky: too frequent means wasted resources. Section 4. A web crawler needs to learn these update patterns to ensure that content is reasonably current. mirrors of frequently accessed sites such as Wikipedia.4. it is important to identify the (dominant) language on a page. navigation bars. . it must prioritize the order in which unvisited pages are downloaded. • Web content changes.

..5. or even the results of additional linguistic processing (for example. Each term is associated with a list of postings. In the example shown in Figure 4. d5 . schemes used in information retrieval. . INVERTED INDEXING FOR TEXT RETRIEVAL terms term1 term2 term3 … postings d1 d11 d1 … p p p d5 d23 d4 p p p d6 d59 d11 p p p d11 d84 d19 p p p … … … Figure 4. In an actual implementation. 4. although other sort orders are possible as well.}.}. d23 . however. Each posting is comprised of a document id and a payload. tokenization and stemming) 5 It is preferable to start numbering the documents at one since it is not possible to code zero with many common compression into basic units that are often not words in the linguistic sense. d59 . anchor text information (text associated with hyperlinks from other pages to the page in question) is useful in enriching the representation of document content (e. one associated with each term that appears in the collection.4 The structure of an inverted index is illustrated in Figure 4. In the web context. More complex payloads include positions of every occurrence of the term in the document (to support phrase queries and document scoring based on term proximity). The simplest payload is…nothing! For simple boolean retrieval.g. For example. properties of the term (such as if it occurred in the page title or not.g. [107]). The most common payload. indicating that the term is part of a place name.1.1. term is preferred over word since documents are processed (e.}. . we assume that documents can be identiﬁed by a unique integer ranging from 1 to n. The document ids have no inherent semantic meaning. is term frequency (tf). . postings are sorted by document id. although assignment of numeric ids to documents need not be arbitrary. d84 . . each of which consists of a document id and a payload— information about occurrences of the term in the document. or the number of times the term occurs in the document. where n is the total number of documents. to support document ranking based on notions of importance). .2 INVERTED INDEXES In its basic form. . see Section 4. an inverted index consists of postings lists.1: Simple illustration of an inverted index. this information is often stored in the index as well. denoted by p in this case. pages from the same 4 In information retrieval parlance. we see that term1 occurs in {d1 . d19 . .5 Generally. to support address searches). .. d6 . d11 . d4 . the existence of the posting itself indicates the presence of the term in the document. An inverted index provides quick access to documents ids that contain a term. d11 . A postings list is comprised of individual postings. no additional information is needed in the posting other than the document id.68 4. . and term3 occurs in {d1 . term2 occurs in {d11 .

Input to the mapper consists of document ids (keys) paired with the actual content (values). usually in compressed form (more details in Section 4. Partial document scores are stored in structures called accumulators.7 provides references to additional readings. The size of an inverted index varies. removing stopwords (common words such as ‘the’. However. Or.3 INVERTED INDEXING: BASELINE IMPLEMENTATION MapReduce was designed from the very beginning to produce the various data structures involved in web search. on PageRank values) might be assigned smaller numeric values so that they appear toward the front of a postings list. the top k documents are then extracted to yield a ranked list of results for the user. retrieval involves fetching postings lists associated with query terms and traversing the postings to compute the result set. commercial web search engines.e. including inverted indexes and the web graph. a well-optimized inverted index can be a tenth of the size of the original document collection. it is possible to hold the entire vocabulary (i. case folding. such as a URL. 4. and stemming (removing afﬁxes from words so that ‘dogs’ becomes ‘dog’). this brief discussion glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. However.. alternatively. INVERTED INDEXING: BASELINE IMPLEMENTATION 69 domain may be consecutively numbered.e.4. At the end (i. An inverted index that stores positional information would easily be several times larger than one that does not. depending on the payload stored in each posting. especially with techniques such as front-coding [156]. query–document scores must be computed. necessarily involves random disk access and “decoding” of the postings. In the general case. tokenizing. Section 4. therefore. but for web pages typically involves stripping out HTML tags and other elements such as JavaScript code. Individual documents are processed in parallel by the mappers. Of course. however.5). Generally. 6 Google keeps indexes in memory.6 postings lists are usually too large to store in memory and must be held on disk. ‘of ’. dictionary of all the terms) in memory. pages that are higher in quality (based.The processing pipeline differs depending on the application and type of document. First. etc. for example. with the exception of well-resourced. If only term frequency is stored. Once again. . Given a query. Query evaluation. an auxiliary data structure is necessary to maintain the mapping from integer document ids to some other more meaningful handle. In the simplest case. once all postings have been processed). our goal is to provide the reader with an overview of the important issues. One important aspect of the retrieval problem is to organize disk operations such that random seeks are minimized.). We begin with the basic inverted indexing algorithm shown in Figure 4.3.2. Either way. there are many optimization strategies for query evaluation (both approximate and exact) that reduce the number of postings a retrieval engine must examine. boolean retrieval involves set operations (union for boolean OR and intersection for boolean AND) on postings lists. ‘a’.. which can be accomplished very efﬁciently since the postings are sorted by document id. each document is analyzed and broken down into its component terms.

.]) P ← new List for all posting a.g. INVERTED INDEXING FOR TEXT RETRIEVAL 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: class Mapper procedure Map(docid n. f2 .4 for more details). and the reducers write postings lists to disk. which simply needs to gather together all the postings and write them to disk. The postings are then sorted by document id. . Each pair. . For each term. After this histogram has been built. H {t} in the pseudo-code. where r is . denoted by n. the mapper then iterates over all terms. the execution framework brings together all the postings that belong in the same postings list. distributed group by of the postings by term. doc d) H ← new AssociativeArray for all term t ∈ doc d do H {t} ← H {t} + 1 for all term t ∈ H do Emit(term t. f ) P . H {t} ) class Reducer procedure Reduce(term t. f1 . f ∈ postings [ n1 .. . The ﬁnal key-value pairs are written to disk and comprise the inverted index. term frequencies are computed by iterating over all the terms and keeping track of counts. the MapReduce runtime essentially performs a large. Typically.] do P . This tremendously simpliﬁes the task of the reducer. the execution framework groups postings by term. n2 . Since each reducer writes its output in a separate ﬁle in the distributed ﬁle system. in line 7 of the mapper pseudo-code.Add( a. represents an individual posting. and the entire postings list is emitted as a value. the postings list is ﬁrst compressed. this algorithm can be easily augmented to store additional information (e. f1 . posting n. The mapper then emits an intermediate key-value pair with the term as the key and the posting as the value. our ﬁnal index will be split across r ﬁles. a pair consisting of the document id and the term frequency is created. term positions) in the payload.The reducer begins by initializing an empty list and then appends all postings associated with the same key (term) to the list. Without any additional effort by the programmer. Once the document has been analyzed. In the shufﬂe and sort phase. with the term as the key. postings P ) Figure 4. Although as presented here only the term frequency is stored in the posting. f2 . but hides the details of document processing.70 4.Sort() Emit(term t. Mappers emit postings keyed by terms. n2 . Lines 4 and 5 in the pseudo-code reﬂect the process of computing term frequencies. postings [ n1 . but we leave this aside for now (see Section 4.2: Pseudo-code of the baseline inverted indexing algorithm in MapReduce.

the retrieval engine can fetch its postings list by opening the appropriate ﬁle and seeking to the correct byte offset position in that ﬁle. two fish doc 2 red fish.4. Separately. tf). we must also build an index to the postings lists themselves for the retrieval engine: this is typically in the form of mappings from term to (ﬁle. The MapReduce programming model provides a very concise expression of the inverted indexing algorithm. INVERTED INDEXING: BASELINE IMPLEMENTATION doc 1 one fish. Intermediate key-value pairs (from the mappers) and the ﬁnal key-value pairs comprising the inverted index (from the reducers) are shown in the boxes with dotted lines.3 with a toy example consisting of three documents. Postings are shown as pairs of boxes. Its implementation is similarly concise: the basic algorithm can be implemented in as few as a couple dozen lines of code in Hadoop (with minimal document processing).3: Simple illustration of the baseline inverted indexing algorithm in MapReduce with three mappers and two reducers. a signiﬁcant . Such an implementation can be completed as a week-long programming assignment in a course for advanced undergraduates or ﬁrst-year graduate students [83. three mappers. and two reducers. There is no need to further consolidate these ﬁles. the number of reducers. 93]. byte offset) pairs. blue fish doc 3 one red bird 71 mapper mapper mapper fish one two d1 d1 d1 2 1 1 blue fish red d2 d2 d2 1 2 1 bird one red d3 d3 d3 1 1 1 Shuffle and Sort: aggregate values by keys reducer reducer fish one two d1 d1 d1 2 1 1 d2 d3 2 1 bird blue red d3 d2 d2 1 1 1 d3 1 Figure 4. with the document id on the left and the term frequency on the right.3. Postings are shown as pairs of boxes (docid. Execution of the complete algorithm is illustrated in Figure 4. so that given a term. In a non-MapReduce indexer.

and by design. t. Since the postings are guaranteed to arrive in sorted order by document id.4 INVERTED INDEXING: REVISED IMPLEMENTATION The inverted indexing algorithm presented in the previous section serves as a reasonable baseline. while the value is the term frequency. combined with the fact that reducers can hold state across multiple keys. and at some point in time.4: in principle. It is entirely possible to implement a disk-based sort within the reducer. As a detail. docid . This.).7 For efﬁcient retrieval. there will only be one value associated with each key. With this modiﬁcation. the programming model ensures that the postings arrive in the correct order. the reducer ﬁrst buffers all postings (line 5 of the reducer pseudo-code in Figure 4. tf f ) In other words. Instead of emitting key-value pairs of the following type: (term t. In MapReduce. For each key-value pair. disk seeks are slow. the entire postings list is emitted. when all postings associated with the same term have been processed (i. INVERTED INDEXING FOR TEXT RETRIEVAL fraction of the code is devoted to grouping postings by term. etc.. f ) We emit intermediate key-value pairs of this type instead: (tuple t. one way to overcome the scalability bottleneck is to let the MapReduce runtime do the sorting for us. .72 4. There is a simple solution to this problem. The Reduce method is called for each key (i. As with the baseline 7 See similar discussion in Section 3. a posting can be directly added to the postings list. The ﬁnal postings list must be written out in the Close method. other than differences in the intermediate key-value pairs.. The mapper remains unchanged for the most part. Since the basic MapReduce execution framework makes no guarantees about the ordering of values associated with the same key. postings lists grow longer.e.4. Finally. memory capacity is limited. postings need to be sorted by document id. n ). 4. posting docid. allows postings lists to be created with minimal memory usage. The revised MapReduce inverted indexing algorithm is shown in Figure 4. this need not be an in-memory sort. Since the execution framework guarantees that keys arrive at each reducer in sorted order. they can be incrementally coded in compressed form—thus ensuring a small memory footprint.4.g. the key is a tuple containing the term and the document id. t = tprev ). reducers will run out of memory.e. However. given constraints imposed by memory and disk (e. the programmer does not need to worry about any of these issues—most of the heavy lifting is performed by the execution framework. remember that we must deﬁne a custom partitioner to ensure that all tuples with the same term are shufﬂed to the same reducer. This is exactly the value-to-key conversion design pattern introduced in Section 3. However. there is a signiﬁcant scalability bottleneck: the algorithm assumes that there is sufﬁcient memory to hold all postings associated with the same term.2) and then performs an in-memory sort before writing the postings to disk.. as collections become larger.

Document length data end up in m different ﬁles. When processing the terms in each document.4: Pseudo-code of a scalable inverted indexing algorithm in MapReduce. doc d) H ← new AssociativeArray for all term t ∈ doc d do H {t} ← H {t} + 1 for all term t ∈ H do Emit(tuple t. .e. this task can actually be folded into the inverted indexing process. n . which is populated as each document is processed. algorithm. document lengths are written out to HDFS (i. There is one more detail we must address when building inverted indexes.. the document length is known. this information must also be extracted. INVERTED INDEXING: REVISED IMPLEMENTATION 1: 2: 3: 4: 5: 6: 7: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 73 class Mapper method Map(docid n. This approach is essentially a variant of the in-mapper combining pattern. where m is the number of mappers. these ﬁles are 8 In general.4. postings P ) P . in the Close method).8 When the mapper ﬁnishes processing input records. Although it is straightforward to express this computation as another MapReduce job. postings P ) Figure 4. We can take advantage of the ability for a mapper to hold state across the processing of multiple documents in the following manner: an in-memory associative array is created to store document lengths. n .g.. there is no worry about insufﬁcient memory to hold these data. Since almost all retrieval models take into account document length when computing query–document scores. By applying the value-to-key conversion design pattern. and can be written out as “side data” directly to HDFS. tf H {t}) class Reducer method Initialize tprev ← ∅ P ← new PostingsList method Reduce(tuple t. the execution framework is exploited to sort postings so that they arrive sorted by document id in the reducer. payloads can be easily changed: by simply replacing the intermediate value f (term frequency) with whatever else is desired (e. f ) tprev ← t method Close Emit(term t.Add( n.4. term positional information). tf [f ]) if t = tprev ∧ tprev = ∅ then Emit(term t.Reset() P .

3). (49. Thus. since the original document ids can be easily reconstructed from the d-gaps. (51. the entire inverted index would be about as large as the collection itself. (2. 4. 1). we can do signiﬁcantly better. (37. parentheses.] where each posting is represented by a pair in parentheses. which is to represent the d-gaps in a way such that it takes less space for smaller numbers. . . This naïve implementation would require six bytes per posting. we must actually encode the ﬁrst document id. 3).295. which is much less than even the most conservative estimate of the size of the web.967.74 4. 1). Let us consider the canonical case where each posting consists of a document id and the term frequency. we reduce the amount of space on disk necessary to store data. (7. document length information can be emitted in special key-value pairs by the mapper. in general. . Otherwise. we need to take a slight detour into compression techniques. . Note that all brackets. a postings list might be encoded as follows: [(5. Similarly. Compression. (12. since for the most part they are also small values. can be characterized as either lossless or lossy: it’s fairly obvious that loseless compression is required in this context. we want to apply the same techniques to compress the term frequencies.294. particularly for coding integers. 2). represented with d-gaps. . (2. and commas are only included to enhance readability. it’s not obvious that we’ve reduced the space requirements either. To start. But to understand how this is done. Using this scheme. Since the postings are sorted by document ids. . However. note that 232 − 1 is “only” 4. but at the cost of extra processor cycles that must be spent coding 9 However. (5.5 INDEX COMPRESSION We return to the question of how postings are actually compressed and stored on disk.] Of course. That is. it is important to understand that all compression techniques represent a time–space tradeoff. the postings would be represented as a long stream of integers. which will be responsible for writing out the length data separate from the postings lists. 1). 2). in reality. since the largest possible d-gap is one less than the number of documents in the collection. A naïve implementation might represent the ﬁrst as a 32-bit integer9 and the second as a 16-bit integer. This chapter devotes a substantial amount of space to this topic because index compression is one of the main differences between a “toy” indexer and one that works on real-world collections. 2). The ﬁrst trick is to encode differences between document ids as opposed to the document ids themselves. Fortunately. would be: [(5. the differences (called d-gaps) must be positive integers greater than zero. We haven’t lost any information. INVERTED INDEXING FOR TEXT RETRIEVAL then consolidated into a more compact representation. The above postings list. . 1). 2). Alternatively. MapReduce inverted indexing algorithms are pretty straightforward. One must then write a custom partitioner so that these special key-value pairs are shufﬂed to a single reducer. This is where the second trick comes in.

encoding d-gaps this way doesn’t yield any reductions in size.00. This is known as variable-length integer coding (varInt for short) and accomplished by using the high order bit of every byte as the continuation bit. As a concrete example.10 The insight is to code groups of four integers at a time. we can achieve the best of both worlds: smaller and faster. A simple lookup table based on the preﬁx byte directs the decoder on how to process subsequent bytes to recover the coded integers. Obviously. the continuation bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps).4. Of course. the comma and the spaces are there only for readability.5. if the two factors are properly balanced (i. the following preﬁx byte: 00. we have 7 bits per byte for coding the value.e. Therefore. the ﬁrst consisting of 1 byte. Each group begins with a preﬁx byte. For example. an integer is encoded in four bytes and holds a value between 0 and 232 − 1.com/people/jeff/WSDM09-keynote. A variant of the varInt scheme was described by Jeff Dean in a keynote talk at the WSDM 2009 conference. inclusive. decoding speed can keep up with disk bandwidth). divided into four 2-bit values that specify the byte length of each of the following integers.294. shifts).. The advantage of this group varInt coding scheme is that values can 10 http://research.google. beyond 4 bytes).01. Furthermore. there is never any ambiguity about where one code word ends and the next begins. the two numbers: 127.967. each group of four integers would consume anywhere between 5 and 17 bytes. 128 would be coded as such: 1 1111111. 0 0000001 1 0000000 The above code contains two code words. it is possible that compression reduces size but also slows processing. A simple approach to compression is to only use as many bytes as is necessary to represent the integer. which slows down processing. and three bytes. respectively. Variable-length integers are byte-aligned because the code words always fall along byte boundaries. 4. and the second consisting of 2 bytes.295 both occupy four bytes. since d-gaps are always positive (and greater than zero).pdf . one byte.. two bytes. 27 ≤ n < 214 with 2 bytes. This scheme can be extended to code arbitrarily large integers (i.e. As a result. As a result. However. INDEX COMPRESSION 75 and decoding data.1 BYTE-ALIGNED AND WORD-ALIGNED CODES In most programming languages.10 indicates that the following four integers are one byte. and 221 ≤ n < 228 with 4 bytes.5. the downside of varInt coding is that decoding involves lots of bit operations (masks. which is set to one in the last byte and zero elsewhere. 214 ≤ n < 221 with 3. which means that 0 ≤ n < 27 can be expressed with 1 byte. This means that 1 and 4. Therefore. We limit our discussion to unsigned integers. However.

Experiments reported by Dean suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme.2 BIT-ALIGNED CODES The advantage of byte-aligned and word-aligned codes is that they can be coded and decoded quickly. some with leftover unused bits. 32-bit. it makes sense to store postings in increments of 16-bit. but even then. three 9-bit integers (with one bit unused). i. packs those integers. all the way up to twenty-eight 1-bit integers. One additional challenge with bit-aligned codes is that we need a mechanism to delimit code words.. 1. This is stored in the selector bits. since 0 is a preﬁx of 01. there are a variety of ways these 28 bits can be divided to code one or more integers: 28 bits can be used to code one 28-bit integer. Anh and Moffat [8] presented several word-aligned coding methods.76 4. decoding involves reading a 32-bit word. Note that unary codes do not allow the representation of zero. In bit-aligned codes. such that a sequence of bits: 0001101001010100 can be unambiguously segmented into: 00 01 1 01 00 1 01 01 00 and decoded without any additional delimiters. The downside. In fact. bits are often wasted). which . tell where the last ends and the next begins.e. and sets the selector bits appropriately.5. one of which is called Simple-9. they are also called preﬁx-free codes). most bit-aligned codes are so-called preﬁx codes (confusingly. on the other hand. in which no valid code word is a preﬁx of any other valid code word. Therefore. One of the simplest preﬁx codes is the unary code. On the other hand. For example. INVERTED INDEXING FOR TEXT RETRIEVAL be decoded with fewer branch mispredicts and bitwise operations. Coding works in the opposite way: the algorithm scans ahead to see how many integers can be squeezed into 28 bits. 1} is a valid preﬁx code. is that they must consume multiples of eight bits. examining the selector to see how the remaining 28 bits are packed. 01} is not a valid preﬁx code. Now. and so we can’t tell if 01 is two code words or one. etc. meaning that boundaries can fall anywhere. The remaining 28 bits are used to code actual integer values. In practice. two 14-bit integers. To address this issue. coding 0 ≤ x < 3 with {0. code words can occupy any number of bits. accessing entire machine words is more efﬁcient than fetching all its bytes separately. since there are no byte boundaries to guide us. however. Therefore. four bits in each 32-bit word are reserved as a selector. {00. or 64-bit machine words. based on 32-bit words. In most architectures. there are nine different ways the 28 bits can be divided into equal parts (hence the name of the technique). and then appropriately decoding each integer. In this coding scheme.. 01. even when fewer bits might sufﬁce (the Simple-9 scheme gets around this by packing multiple integers into a 32-bit word. 4. An integer x > 0 is coded as x − 1 one bits followed by a zero bit. coding and decoding bit-aligned codes require processing bytes and appropriately shifting or masking bits (usually more involved than varInt and group varInt coding).

We can then reconstruct x as 2cu −1 + cb . An integer x > 0 is broken into two components. This tells us that the binary portion is written in cu − 1 bits. A variation on γ code is δ code. 3. 10) codes. consider x = 10: 1 + log2 10 = 4. INDEX COMPRESSION 77 x 1 2 3 4 5 6 7 8 9 10 unary 0 10 110 1110 11110 111110 1111110 11111110 111111110 1111111110 γ 0 10:0 10:1 110:00 110:01 110:10 110:11 1110:000 1110:001 1110:010 Golomb b=5 0:00 0:01 0:10 0:110 0:111 10:00 10:01 10:10 10:110 10:111 b = 10 0:000 0:001 0:010 0:011 0:100 0:101 0:1100 0:1101 0:1110 0:1111 Figure 4. it is easy to unambiguously decode a bit stream of γ codes: First. The binary component codes x − 23 = 2 in 4 − 1 = 3 bits. Unary codes are rarely used by themselves. Unary codes of the ﬁrst ten positive integers are shown in Figure 4. the length). is ﬁne since d-gaps and term frequencies should never be zero.11 As an example. and Golomb (b = 5. so. For reference. which is in binary. which although economical for small values. which is a preﬁx code.12 The unary component n speciﬁes the number of bits required to code x.g. γ codes occupy less than a full byte. some sources describe slightly different formulations of the same coding scheme. For smaller values γ codes are more compact. e. it’s not part of the ﬁnal code. Since term frequencies for the most part are relatively small. The extra colon is inserted only for readability. As an example. we read a unary code cu . With unary code we can code x in x bits.8 = 3. behavior in many programming languages when casting from a ﬂoating-point type to an integer type. γ codes make sense for them and can yield substantial space savings. Putting both together.4. we adopt the conventions used 12 Note that x is the ﬂoor function. γ . of course. . but form a component of other coding schemes. the γ codes of the ﬁrst ten positive integers are shown in Figure 4.5. Here. 4 in unary code is 1110. δ codes take less space. which maps x to the largest integer not greater than x. which makes them more compact than variable-length integer codes. we arrive at 1110:010. 1 + log2 x (= n. but for larger values. which is 010. Working in reverse. Elias γ code is an efﬁcient coding scheme that is widely used in practice. and x − 2 log2 x (= r. which is 1110. the remainder). This is the default in the classic IR text Managing Gigabytes [156]. which we then read as cb . and the binary component codes the remainder r in n − 1 bits. where the n portion of the γ code is coded in γ code itself (as opposed to unary code). 11 As a note.5: The ﬁrst ten positive integers in unary.5.. becomes inefﬁcient for even moderately large values.5. For x < 16. which is coded in unary code.

Golomb and Rice codes performed the best. Simple-9. we refer the reader elsewhere [156]. As an example. Golomb codes of the ﬁrst ten positive integers are shown in Figure 4. Actual performance also depends on characteristics of the collection. then coding and decoding can be handled more efﬁciently (needing only bit shifts and bit masks. For reference.3 POSTINGS COMPRESSION Having completed our slight detour into integer compression techniques. A special case of Golomb code is worth noting: if b is a power of two.5 for b = 5 and b = 10. if b = 5. we can use the same trick to code differences between term positions using γ codes. we code the ﬁrst 2 log2 b +1 − b values of r in log2 b bits and code the rest of the values of r by coding r + 2 log2 b +1 − b in ordinary binary representation using log2 b + 1 bits. [30] recently compared the performance of various compression techniques on coding document ids. In terms of raw decoding speed. These are known as Rice codes. 3. and N is the number of documents in the collection. Büttcher et al. In this case. 2. among other factors. 111}. 4. which would be coded with the following code words: {00. For some parameter b. the order was almost 13 For details as to why this is the case. which. This is accomplished as follows: if b is a power of two. Researchers have shown that Golomb compression works well for d-gaps. the r is coded in either log2 b or log2 b + 1 bits. but even better compression can be achieved with parameterized codes. From this we can derive an optimal setting of b. 01.78 4. but here’s the intuition: under reasonable assumptions. A good example of this is Golomb code. requiring log2 b bits. truncated binary codes are preﬁx codes. 110. 4}. Otherwise. we compute q = (x − 1)/b and code q + 1 in unary. If positional information is desired. and is optimal with the following parameter setting: b ≈ 0. then.4 and discuss how postings lists can be properly compressed. an integer x > 0 is coded in two parts: ﬁrst.5. followed by γ codes. 1. and group varInt (the least space efﬁcient). varInt. the appearance of postings can be modeled as a sequence of independent Bernoulli trials. as opposed to multiplication and division). then r can take the values {0. 10. . which implies a certain distribution of d-gaps.69 × N df (4. we code the remainder r = x − qb − 1 in truncated binary. one popular approach for postings compression is to represent d-gaps with Golomb codes and term frequencies with γ codes [156. then truncated binary is exactly the same as normal binary. and unlike ordinary binary coding. there is a wide range of choices that represent different trade offs between compression ratio and decoding speed. INVERTED INDEXING FOR TEXT RETRIEVAL Unary and γ codes are parameterless. 162].13 Putting everything together. In terms of the amount of compression that can be obtained (measured in bits per docid). As we can see from the previous section. determine the distribution of d-gaps. we can now return to the scalable inverted indexing algorithm shown in Figure 4.1) where df is the document frequency of the term.

docid . inside the mapper. we deﬁne the sort order of the tuple so that the special symbol ∗ precedes all documents (part of the order inversion design pattern).. To ensure that these special keys arrive ﬁrst. or df). df e) to keep track of document frequencies associated with each term. followed by Simple-9 was substantially slower. The solution is to have the mapper emit special keys of the form t.e. in addition to emitting intermediate key-value pairs of the following form: (tuple t. associated with a list of values representing partial df values originating from each mapper. this study found less speed difference between group varInt and basic varInt than Dean’s analysis. the document frequency. Rice codes were the fastest. The ﬁrst is easy to obtain and can be passed into the reducer as a constant.5. the reducer will ﬁrst encounter the t.. to illustrate how one might set compression parameters. The df of a term.4. At ﬁrst glance. tf f ) we also emit special intermediate key-value pairs like this: (tuple t. ∗ are emitted with the partial df as the value. To get around this problem. since two parameters are required: the size of the document collection and the number of postings for a particular postings list (i. special keys of the form t. . Compressing dgaps with Golomb codes. That is. Within the bit-aligned codes. at the cost of slower decoding compared to alternatives. with Golomb codes being the slowest (about ten times slower than group varInt). ∗ . Summing all these partial contributions will yield the term’s 14 However.1). Once the mapper has processed all input records.e. followed by γ . however. and the bit-aligned codes were even slower than that. Thus. for each term. Let us discuss what modiﬁcations are necessary to our inverted indexing algorithm if we were to adopt Golomb compression for d-gaps and represent term frequencies with γ codes. ∗ to communicate partial document frequencies. the parameter must be known before any posting is coded. presumably due to varInt. however. The mapper holds an in-memory associative array that keeps track of how many documents a term has been observed in (i. is a bit tricky. This can be solved with the order inversion design pattern introduced in Section 3. we can accomplish this by applying the in-mapper combining design pattern (see Section 3. Whether or not this is actually a worthwhile tradeoff in practice is not important here: use of Golomb codes serves a pedagogical purpose. Note that this represents a space-efﬁcient encoding. Coding term frequencies with γ codes is easy since they are parameterless.3 to compute relative frequencies.14 the different distribution of d-gaps in the collections they were examining. the local document frequency of the term for the subset of documents processed by the mapper). this seems like a chickenand-egg problem. ∗ key. we need to somehow inform the reducer of a term’s df before any of its postings arrive. A two-pass solution that involves ﬁrst buffering the postings (in memory) would suffer from the memory bottleneck we’ve been trying to avoid in the ﬁrst place. is not known until all the postings have been processed—and unfortunately. In practice. INDEX COMPRESSION 79 the reverse: group varInt was the fastest.

What about retrieval? It should be fairly obvious that MapReduce.. Looking up postings implies random disk seeks. random access is not a forte of the distributed ﬁle system underlying MapReduce—such operations require multiple round-trip network exchanges (and associated latencies). Recall that the pattern is useful when a reducer needs to access the result of a computation (e. This corresponds to partitioning vertically in Figure 4. In other words. Since users demand sub-second response times. Recall the basic retrieval problem: we must look up postings lists corresponding to query terms. and then return the top k results to the user. since for the most part postings are too large to ﬁt into memory (leaving aside caching and other special cases for now). INVERTED INDEXING FOR TEXT RETRIEVAL df. Once again. the entire collection is broken up into multiple smaller sub-collections. and then returns the ﬁnal results to the user. a client must ﬁrst obtain the location of the desired data block from the namenode before the appropriate datanode can be contacted for the actual data. In this case. Document and term partitioning require different retrieval strategies and represent different trade offs. each of which is assigned to a server. a server holds the postings for all documents in the collection for a subset of terms. access will typically require a random disk seek on the datanode itself. 4. With this architecture. which can then be used to set the Golomb compression parameter b. Under document partitioning. There are two main partitioning strategies for distributed retrieval: document partitioning and term partitioning. which necessitates breaking up the index in some manner. the order inversion design pattern comes to the rescue.6. an aggregate statistic) before it encounters the data necessary to produce that computation. systematically traverse those postings lists to compute query– document scores. it’s the document frequency. each server is responsible for a subset of the terms for the entire collection. which is exactly the opposite tradeoff made in MapReduce. This allows the postings to be incrementally compressed as they are encountered in the reducer—memory bottlenecks are eliminated since we do not need to buffer postings in memory. This corresponds to partitioning horizontally in Figure 4. each of whom demand sub-second response times. merges partial results from each. In HDFS. which forwards the user’s query to all partition servers. That is. The only solution is to distribute retrieval across a large number of machines. on the other hand. searching the entire collection requires that the query be processed by every partition server. For computing relative frequencies. Of course. every aspect of retrieval must be optimized for low latency.6 WHAT ABOUT RETRIEVAL? Thus far. which was designed for large batch operations. is beyond the capabilities of any single machine. Retrieval under document partitioning involves a query broker. Unfortunately. we have brieﬂy discussed web crawling and focused mostly on MapReduce algorithms for inverted indexing. since each partition operates independently and traverses postings . With term partitioning. each server holds the complete index for a subset of the entire collection.6. It should be fairly obvious that serving the search needs of a large number of users. However. is a poor solution for retrieval. that bit of information was the marginal.g.80 4.

The server traverses the appropriate postings list and computes partial query– document scores.6. b. q2 .4. document partitioning typically yields shorter query latencies (compared to a single monolithic index with much longer postings lists). before ﬁnal results are passed back to the broker and returned to the user. 2. nine terms) illustrating different partitioning strategies: partitioning vertically (1.6: Term–document matrix for a toy collection (nine documents. and then to the server for q3 . The accumulators are then passed to the server that holds the postings associated with q2 for additional processing. . q1 . in parallel. 3) corresponds to document partitioning. requires a very different strategy. stored in the accumulators. and q3 . Retrieval under term partitioning. Under the pipelined query evaluation strategy. However. Although this query evaluation strategy may not substantially reduce the latency of any particular query. on the other hand. c) corresponds to term partitioning. whereas partitioning horizontally (a. the broker begins by forwarding the query to the server that holds the postings for q1 (usually the least frequent term). it can theoretically increase a system’s throughput due to the far smaller number of total disk seeks required for each user query (compared to document partitioning). WHAT ABOUT RETRIEVAL? d1 t1 t2 t3 t4 t5 t6 t7 t8 t9 1 1 2 2 1 1 1 2 2 3 1 4 partitionc 1 1 5 1 3 d2 d3 2 1 2 2 2 partitionb 1 d4 d5 d6 d7 d8 3 4 partitiona d9 81 partition1 partition2 partition3 Figure 4. which can create “hot spots” on servers that hold the postings for frequently occurring query terms. load-balancing is tricky in a pipelined termpartitioned architecture due to skew in the distribution of query terms. Suppose the user’s query Q contains three terms.

as opposed to forwarding the user query to all the partitions. they are happy with a single relevant document). It is typically the responsibility of document servers. This is . so that each partition is “well separated” from the others in terms of topical coverage. a service must route queries to the appropriate location. functionally distinct from the partition servers holding the indexes. There are two ﬁnal components of real-world search engines that are worth discussing. This also reduces the number of machines that need to be involved in serving a user’s query: the broker can direct queries only to the partitions that are likely to contain relevant documents. Furthermore. Along an orthogonal dimension. This reduces the number of servers that need to be contacted for a user query. For most queries. recall that postings only store document ids. a document server takes as input a query and a document id. This creates at least two query routing problems: since it makes sense to serve clients from the closest datacenter. of individual postings (assuming that the index is not already in memory) and even results of entire queries [13]. Note that partitions may be unavailable due to reasons other than machine failure: cycling through different partitions is a very simple and non-disruptive strategy for index updates. Abstractly. First. but they are generally less discriminating when it comes to which relevant documents appear in their results (out of the set of all relevant documents). see [124] for a recent survey on web page classiﬁcation. raw retrieval results consist of a ranked list of semantically meaningless document ids. One key advantage of document partitioning is that result quality degrades gracefully with machine failures. it is known that Google maintains its indexes in memory (although this is certainly not the common case for search engines in general). Partition servers that are ofﬂine will simply fail to deliver results for their subsets of the collection. Partitioning by document quality supports a multi-phase search strategy: the system examines partitions containing high quality documents ﬁrst. the web contains more relevant documents than any user has time to digest: users of course care about getting relevant documents (sometimes. a snippet of the source document showing the user’s query terms in context.82 4. reliability of service is provided by replication. query evaluation can beneﬁt immensely from caching. both in terms of multiple machines serving the same partition within a single datacenter. and this is the strategy adopted by Google [16]. Working in a document-partitioned architecture. studies have shown that document partitioning is a better strategy overall [109]. and computes an appropriate result entry. Second. Therefore. Along one dimension. INVERTED INDEXING FOR TEXT RETRIEVAL In general. users might not even be aware that documents are missing. it is desirable to partition by document quality using one or more classiﬁers. it is desirable to partition documents by content (perhaps also guided by the distribution of user queries from logs). Proper partitioning of the collection can address one major weakness of this architecture. which is that every partition server is involved in every user query. the system needs to properly balance load across replicas. Within a single datacenter. On a large scale. there are a variety of approaches to dividing up the web into smaller pieces. typically comprising the title and URL of the page. to generate meaningful output for user presentation. but also replication across geographically distributed datacenters. and only backs off to partitions containing lower quality documents if necessary. and additional metadata about the document. With sufﬁcient partitions.

but quickly noticed a scalability bottleneck that stemmed from having to buffer postings in memory.4. While outdated in many other respects. 4.7 SUMMARY AND ADDITIONAL READINGS Web search is a complex problem that breaks down into three conceptually distinct components. [30] is noteworthy in having detailed experimental evaluations that compare the performance (both effectiveness and efﬁciency) of a wide range of algorithms and techniques. A survey article by Zobel and Moffat [162] is an excellent starting point on indexing and retrieval algorithms. however. Here. Our brief discussion of web search glosses over many complexities and does a huge injustice to the tremendous amount of research in information retrieval. SUMMARY AND ADDITIONAL READINGS 83 made possible by the Zipﬁan distribution of queries. we provide a few entry points into the literature. inverted indexes and other auxiliary data structures must be built from the documents. proceedings from those events are perhaps the best starting point for those wishing to keep abreast of publicly documented developments in the ﬁeld.pdf . As a speciﬁc example.15 Finally. Another by Baeza-Yates et al.3 for computing relative frequencies can be used to properly set compression parameters. 15 http://research. Finally. 42. with very frequent queries at the head of the distribution dominating the total number of queries. Additional Readings. Next.com/people/jeff/WSDM09-keynote. Of these three. which are functionally distinct from all of the components discussed above. ACM SIGIR is an annual conference and the most prestigious venue for academic information retrieval research. Application of the value-to-key conversion design pattern (Section 3. the problem most suitable for MapReduce. inverted indexing is nothing but a very large distributed sort and group by operation! We began with a baseline implementation of an inverted indexing algorithm. a number of general information retrieval textbooks have been recently published [30. index structures must be accessed and processed in response to user queries to generate search results. the textbook Managing Gigabytes [156] remains an excellent source for index compression techniques. Both of these can be considered ofﬂine problems. We showed how the order inversion design pattern introduced in Section 3. Search engines take advantage of this with cache servers. Finally. We also surveyed various techniques for integer compression.4) addressed the issue by ofﬂoading the task of sorting postings by document id to the MapReduce execution framework. A keynote talk at the WSDM 2009 conference by Jeff Dean revealed a lot of information about the evolution of the Google search architecture. After all. one could use Golomb codes for compressing d-gaps and γ codes for term frequencies. This chapter primarily focused on building inverted indexes. 101]. First. which yield postings lists that are both more compact and faster to process. the documents collection must be gathered (by crawling the web).7. [11] overviews many important issues in distributed retrieval. This last task is an online problem that demands both low latency and high throughput.google. the one by Büttcher et al.

.

there might be demographic information (e.). In some graphs. in others.g. • Graph clustering. Path planning problems involving everything from network packets to delivery trucks represent another large class of graph search problems. gender. Far more than theoretical curiosities. making graph analysis an important component of many text processing applications. location) attached to the nodes and type information attached to the links (e. complex graph involving interactions between genes. there may be an edge from a node to itself. graphs can be characterized by nodes (or vertices) and links (or edges) that connect pairs of nodes. Our very own existence is dependent on an intricate metabolic and regulatory network. phone call patterns. we use node interchangeably with vertex and similarly with link and edge. documents frequently exist in the context of some underlying network. resulting in a self loop. whenever anyone searches for directions on the web. in a social network where nodes represent individuals.). etc. We assume that both nodes and links may be annotated with additional metadata: as a simple example.. Although most of the content has nothing to do with text processing per se. and other cellular products.. dating back to Euler’s paper on the Seven Bridges of Königsberg in 1736. this task is 1Throughout this chapter. such edges are disallowed.85 CHAPTER 5 Graph Algorithms Graphs are ubiquitous in modern society: examples encountered by almost everyone on a daily basis include the hyperlink structure of the web (simply known as the web graph). a measure of web page quality based on the structure of hyperlinks. etc. social networks (manifest in the ﬂow of email. PageRank exempliﬁes a large class of graph algorithms that can be concisely captured in the programming model. indicating type of relationship such as “friend” or “spouse”). As one of the ﬁrst applications of MapReduce. connections on social networking sites. Search algorithms on graphs are invoked millions of times a day. graphs have been extensively studied. as measured by inter-component links [59])? Among other applications.g. which is used in ranking results for web search. In general. Can a large graph be divided into components that are relatively disjoint (for example. which can be characterized as a large. Similar algorithms are also involved in friend recommendations and expert-ﬁnding in social networks. ﬂights. . This chapter focuses on graph algorithms in MapReduce. and transportation networks (roads. bus routes. Perhaps the best known example is PageRank. age. We will discuss PageRank in detail later this chapter. and today much is known about their properties. Over the past few centuries. theorems and algorithms on graphs can be applied to solve many real-world problems: • Graph search and path planning. proteins. Mathematicians have always been fascinated with graphs.1 These connections can be directed or undirected.

epidemiologists modeling the spread of diseases. a longer description is anticipated in a forthcoming paper [99]. • Minimum spanning trees. the max ﬂow problem involves computing the amount of “trafﬁc” that can be sent from source to sink given various ﬂow capacities deﬁned by edge weights.2) and PageRank (Section 5. There are many ways to deﬁne what special means. A real-world example of this problem is a telecommunications company that wishes to lay optical ﬁber to span a number of destinations at the lowest possible cost (where weights denote costs). average distance to other nodes. and then explore two classic graph algorithms in MapReduce: parallel breadth-ﬁrst search (Section 5.86 5. the hyperlink structure of the web. based on Valiant’s Bulk Synchronous Parallel model [148]. • Identifying “special” nodes. which contains billions of pages. and many others. • Bipartite graph matching. including social networks and the migration of Polynesian islanders [64]. Google recently published a short description of a system called Pregel [98]. . etc. Clearly. shipping. A bipartite graph is one whose vertices can be divided into two disjoint sets. GRAPH ALGORITHMS useful for identifying communities in social networks (of interest to sociologists who wish to understand how human relationships form and evolve) and for partitioning large graphs (of interest to computer scientists who seek to better parallelize graph processing). These special nodes are important to investigators attempting to break up terrorist cells. algorithms that run on a single machine and depend on the entire graph residing in memory are not scalable. including metrics based on node in-degree.2 This chapter is organized as follows: we begin in Section 5. We’d like to put MapReduce to work on these challenges. advertisers trying to promote products. Before concluding with a summary and pointing out additional readings. Section 5. or social networks that contain hundreds of millions of individuals. • Maximum ﬂow. for large-scale graph algorithms.) and network operators grapple with complex versions of these problems on a daily basis. A minimum spanning tree for a graph G with weighted edges is a tree that contains all vertices of the graph and a subset of edges connecting all the vertices together that minimizes the sum of edge weights.3). 2 As a side note.1 with an introduction to graph representations. Transportation companies (airlines. Matching problems on such graphs can be used to model job seekers looking for employment or singles looking for dates.4 discusses a number of general issue that affect graph processing with MapReduce. and relationship to cluster structure. See [158] for a survey. A common feature of these problems is the scale of the datasets on which the algorithms must operate: for example. In a weighted directed graph with two special nodes called the source and the sink. This approach has also been applied to wide variety of problems.

cells above the diagonal). A graph with n nodes can be represented as an n × n square matrix M.3 For example. As a result. With undirected graphs.. where n is the number of vertices. even the most gregarious will have relatively few friends compared to the size of the network (thousands. the matrix cells contain edge weights. n2. such a representation is far from ideal for computer scientists concerned with efﬁcient algorithmic implementations. there is no precise deﬁnition of sparseness agreed upon by all. The major problem with an adjacency matrix representation for sparse graphs is its O(n2 ) space requirement.1 also shows the adjacency list representation of the graph under consideration (on the right). In the case of graphs with weighted edges. . For graphs that allow self loops (a directed edge from a node to itself ). Most of the applications discussed in the chapter introduction involve sparse graphs. since n1 is connected 3 Unfortunately.1. or a zero (indicating none). otherwise. the diagonal might be populated. Figure 5.1: A simple directed graph (left) represented as an adjacency matrix (middle) and with adjacency lists (right). The same is true for the hyperlink structure of the web: each individual web page links to a minuscule portion of all the pages on the web. each cell contains either a one (indicating an edge). most computational implementations of graph algorithms operate over adjacency lists. In this chapter. in a social network of n individuals. Furthermore. although we will return to this issue in Section 5.4. n3] 87 n4 adjacency matrix adjacency lists Figure 5.g. For example. Figure 5. in which a node is associated with neighbors that can be reached via outgoing edges. GRAPH REPRESENTATIONS n2 n1 n1 n1 n2 n3 n5 n3 n4 n5 0 0 0 0 1 n2 1 0 0 0 1 n3 0 1 0 0 1 n4 1 0 1 0 0 n5 0 1 0 1 0 n1 n2 n3 n4 n5 [n2. However. by deﬁnition. but still far smaller than hundreds of millions). Although mathematicians prefer the adjacency matrix representation of graphs for easy manipulation with linear algebra. we assume processing of sparse graphs. but one common deﬁnition is that a sparse graph has O(n) edges.1 provides an example of a simple directed graph (left) and its adjacency matrix representation (middle). perhaps. where the number of actual edges is far smaller than the number of possible edges.1 GRAPH REPRESENTATIONS One common way to represent a graph is with an adjacency matrix. most of the cells are zero. the diagonal remains empty. where a value in cell mij indicates an edge from node ni to node nj . there are n(n − 1)/2 possible “friendships” (where n may be on the order of hundreds of millions). only half the matrix is used (e. 5.5. n4] [n3. otherwise. n5] [n4] [n5] [n1.

and the source node s. the shufﬂe and sort mechanism in MapReduce provides an easy way to group edges by their destination nodes. This property of the execution framework can also be used to invert the edges of a directed graph. Alternatively..4 5.2 PARALLEL BREADTH-FIRST SEARCH One of the most common and well-studied problems in graph theory is the single-source shortest path problem. with MapReduce? As a refresher and also to serve as a point of comparison. w containing edge distances such that w(u. Note that the algorithm as presented 4This technique is used in anchor text inversion. the algorithm “expands” that node by traversing the adjacency list of the selected node to see if any of those nodes can be reached with a path of a shorter distance. E) represented with adjacency lists. The algorithm begins by ﬁrst setting distances to all vertices d[v]. this famous algorithm assumes sequential processing—how would we solve this problem in parallel. However. thus allowing us to compute over incoming edges in the reducer. in which case the task is to compute lowest-cost or lowest-weight paths). In the ﬁrst. it is natural to operate on outgoing links. [107]). this is the source node). The algorithm maintains Q. and Rivest’s classic algorithms textbook [41] (often simply known as CLR). if i < j . when all nodes have been considered. It is common practice to enrich a web page’s standard textual representation with all of the anchor texts associated with its incoming hyperlinks (e.2. where one gathers the anchor text of hyperlinks pointing to a particular page. one could order the nodes (arbitrarily or otherwise) and encode edges only on the adjacency list of the node that comes ﬁrst in the ordering (i. except for the source node. whose distance to itself is zero. whereas operations on outgoing links for each node translate into a row scan. However. The algorithm terminates when the priority queue Q is empty.88 5.e. a global priority queue of vertices with priorities equal to their distance values d. each appears on each other’s adjacency list). With adjacency lists. operations on incoming links for each node translate into a column scan on the matrix. then nj is on the adjacency list of ni . The input to the algorithm is a directed. v ∈ V to ∞. those two nodes will be on the adjacency list of n1 .g. as we shall see. connected graph G = (V . Dijkstra’s algorithm operates by iteratively selecting the node with the lowest current distance from the priority queue (initially. At each iteration. adapted from Cormen. or equivalently. v) ≥ 0. but computing anything that requires knowledge of the incoming links of a node is difﬁcult. and more speciﬁcally. . GRAPH ALGORITHMS by directed edges to n2 and n4 . There are two options for encoding undirected graphs: one could simply encode each edge twice (if ni and nj are connected. but not the other way around). Dijkstra’s algorithm is shown in Figure 5. where students are taught the solution using Dijkstra’s algorithm. where the task is to ﬁnd shortest paths from a source node to all other nodes in the graph (or alternatively. Note that certain graph operations are easier on adjacency matrices than on adjacency lists. by mapping over the nodes’ adjacency lists and emitting key–value pairs with the destination node id as the key and the source node id as the value. Leiserson.. edges can be associated with costs or weights. Such problems are a staple in undergraduate algorithm courses.

At each iteration. respectively. where we’ve discovered the shortest distance to all nodes. we see in (b) that n2 and n3 can be reached at a distance of 10 and 5. in Figure 5. n1 is selected as the node to expand (indicated by the thicker border). First. which is based on maintaining a global priority queue of nodes with priorities equal to their distances from the source node. the distance of all nodes directly connected to those is two. The nodes that will be expanded next.2 only computes the shortest distances. what if there are multiple paths to the same node? Suppose we wish to compute the shortest distance to node n. as the programming model does not provide a mechanism for exchanging global data. we see in (b) that n3 is the next node selected for expansion.3 (example also adapted from CLR). This is not possible in MapReduce. hyperlinks on the web). However. s) d[s] ← 0 for all vertex v ∈ V do d[v] ← ∞ Q ← {V } while Q = ∅ do u ← ExtractMin(Q) for all vertex v ∈ u. but we’ll relax this restriction later. v) Figure 5. The key to Dijkstra’s algorithm is the priority queue that maintains a globally sorted list of nodes by current distance. The algorithm terminates with the end state shown in (f ). for example. are n5 . we see in (c) that the distance to n2 has decreased because we’ve found a shorter path.AdjacencyList do if d[v] > d[u] + w(u. PARALLEL BREADTH-FIRST SEARCH 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 89 Dijkstra(G. in order. The shortest path must go through one of the nodes in M that contains an outgoing edge to n: we need to examine all m ∈ M to ﬁnd ms . Also.2. A sample trace of the algorithm running on a simple graph is shown in Figure 5. The intuition behind the algorithm is this: the distance of all nodes connected directly to the source node is one. and n4 . and so on. the node with the shortest distance. After the expansion. w. In the ﬁrst iteration (a). . The actual paths can be recovered by storing “backpointers” for every node indicating a fragment of the shortest path. Instead. v) then d[v] ← d[u] + w(u. as a simpliﬁcation let us assume that all edges have unit distance (modeling. n2 .5. Imagine water rippling away from a rock dropped into a pond—that’s a good image of how parallel breadthﬁrst search works. We start out in (a) with n1 having a distance of zero (since it’s the source) and all other nodes having a distance of ∞.2: Pseudo-code for Dijkstra’s algorithm. Expanding n3 . Nodes we have already considered for expansion are shown in black. The shortest distance to n is the distance to ms plus one. we adopt a brute force approach known as parallel breadth-ﬁrst search. the algorithm expands the node with the shortest distance and updates distances to all reachable nodes. This makes the algorithm easier to understand.

As with Dijkstra’s algorithm. directed graph represented as adjacency lists. and . The second iteration.90 5. with n1 as the source and edge distances as indicated. Distance to each node is directly stored alongside the adjacency list of that node. we discover all nodes connected to those. In the pseudo-code. The key contains the node id of the neighbor.3: Example of Dijkstra’s algorithm applied to a simple graph with ﬁve nodes. Pseudo-code for the implementation of the parallel breadth-ﬁrst search algorithm is provided in Figure 5. then we must be able to reach all the nodes that are connected to n with distance d + 1. The ﬁrst time we run the algorithm. and initialized to ∞ for all nodes except for the source node. GRAPH ALGORITHMS n2 1 10 10 2 5 2 3 9 4 7 6 n4 n2 n4 1 10 n2 n4 1 10 0 n1 5 2 3 8 9 4 7 6 14 9 4 6 0 n1 0 n1 5 2 3 7 5 n5 n3 2 5 n5 n3 2 7 n5 n3 (a) n2 n4 1 (b) n2 n4 1 (c) n2 n4 1 8 10 13 10 9 4 6 8 0 n1 5 2 3 9 10 9 4 7 6 8 0 n1 5 2 3 9 9 4 7 6 0 n1 5 2 3 7 5 n3 2 7 n5 5 n3 2 7 n5 5 n3 2 7 n5 (d) (e) (f) Figure 5. After shufﬂe and sort. This says: if we can reach node n with a distance d. with the current distance inside the node.4. we “discover” all nodes that are connected to the source. Parts (a)–(e) show the running of the algorithm at each iteration. we assume a connected. The algorithm works by mapping over all nodes and emitting a key-value pair for each neighbor on the node’s adjacency list. It is apparent that parallel breadth-ﬁrst search is an iterative algorithm. The reducer will select the shortest of these distances and then update the distance in the node data structure. reducers will receive keys corresponding to the destination node ids and distances corresponding to all paths leading to that node. we use n to denote the node id (an integer) and N to denote the node’s corresponding data structure (adjacency list and current distance). nodes that have already been expanded are shown in black. where each iteration corresponds to a MapReduce job. Nodes with thicker borders are those being expanded. and the value is the current distance to the node plus one.

Each iteration (one MapReduce job) of the algorithm expands the “search frontier” by one hop. while the reducers select the minimum of those distances for each destination node. line 4 in the mapper). Before we discuss termination of the algorithm. so on.). with the node id as a key (Figure 5. .AdjacencyList do Emit(nid m. Each iteration of the algorithm expands the “search frontier” by one hop. . . there is one more detail required to make the parallel breadth-ﬁrst search algorithm work.2.Distance Emit(nid n. d2 . This number is surprisingly small for many real-world problems: the saying “six degrees of separation” suggests that everyone on the planet is connected to everyone else by at most six steps (the people a person knows are one step away. . . [d1 .5 So how many iterations are necessary to compute the shortest distance to all nodes? The answer is the diameter of the graph. 5 Note that in this algorithm we are overloading the value type. If this is indeed true.4. etc.] do if IsNode(d) then M←d else if d < dmin then dmin ← d M. and.Distance← dmin Emit(nid m. d2 . The ﬁnal output is now ready to serve as input to the next iteration. PARALLEL BREADTH-FIRST SEARCH 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 91 class Mapper method Map(nid n. or the greatest distance between any pair of nodes. all nodes will be discovered with their shortest distances (assuming a fully-connected graph).5. d + 1) class Reducer method Reduce(nid m. and update the minimum distance in the node data structure before emitting it as the ﬁnal value. In the reducer. This is accomplished by emitting the node data structure itself. The best way to achieve this in Hadoop is to create a wrapper object with an indicator variable specifying what the content is. node N) d ← N. lines 5–6 in the reducer). eventually. . .4. people that they know are two steps away. We need to “pass along” the graph structure from one iteration to the next. which can either be a distance (integer) or a complex data structure representing a node.4: Pseudo-code for parallel breath-ﬁrst search in MapReduce: the mappers emit distances to reachable nodes. N) for all nodeid m ∈ N. node M) Pass along graph structure Emit distances to reachable nodes Recover graph structure Look for shorter distance Update shortest distance Figure 5.]) dmin ← ∞ M←∅ for all d ∈ counts [d1 . we must distinguish the node data structure from distance values (Figure 5.

The actual checking of the termination condition must occur outside of MapReduce.4. In line 6 of the mapper code in Figure 5. But how do we know that we’ve found the shortest distance to p? Well. will work. and in this iteration. The adjacency lists.. consider the graph fragment in Figure 5. as with Dijkstra’s algorithm in the form presented earlier. we refer the reader to a number of publications [2. can be used for counting events that occur during execution.g. For more serious academic studies of “small world” phenomena in networks. .e. which. sequence of node ids) are relatively short. To illustrate. all nodes are reachable. number of times a certain condition is met. Let us relax that restriction and see what changes are required in the parallel breadth-ﬁrst search algorithm. 61. 62. but the termination behavior is very different.Typically. which submits a MapReduce job to iterate the algorithm. must now encode the edge distances as well. so that each node will have its shortest path easily accessible at all times. where s is the source node. we have been assuming that all edges are unit distance. A simpler approach is to emit paths along with distances in the mapper. and if not. as the search frontier expands. Counters can be deﬁned to count the number of nodes that have distances of ∞: at the end of the job. instead of emitting d + 1 as the value. Since the graph is connected. however. if the shortest path to p lies within the search frontier. all discovered nodes are guaranteed to have the shortest distances (i. e. as the name suggests. number of corrupt records. we must now emit d + w where w is the edge distance. but may not be efﬁcient since the graph needs to be traversed again to reconstruct the path segments. since there may exist a path going through q that we haven’t encountered yet (because it lies outside the search frontier). However.92 5. as with Dijkstra’s algorithm.6 However. the parallel breadth-ﬁrst search algorithm only ﬁnds the shortest distances. Up until now.. Storing “backpointers” at each node.5. or anything that the programmer desires. we’ll eventually cover q and all other nodes along the path from p to q to r—which means that with sufﬁcient iterations. not the actual shortest paths. we will discover the shortest distance to r. the path can be straightforwardly recovered. 6 Note that the same argument does not apply to the unit edge distance case: the shortest path cannot lie outside the search frontier since any such path would necessarily be longer.. Assume for the sake of argument that we’ve already discovered the shortest distance to node p. In practical terms. since for the most part paths (i. the driver program can access the ﬁnal counter value and check to see if another iteration is necessary. Hadoop provides a lightweight API for constructs called “counters”. does not guarantee that we’ve discovered the shortest distance to r. which were previously lists of node ids. No other changes to the algorithm are required. This. there is not a shorter path that goes through a node that hasn’t been discovered). and since all edge distances are one. The additional space requirements for shufﬂing these data from mappers to reducers are relatively modest. repeats. we just “discovered” node r for the very ﬁrst time. GRAPH ALGORITHMS then parallel breadth-ﬁrst search on the global social network would take at most six MapReduce iterations. and that the shortest distance to r so far goes through p.e. execution of an iterative MapReduce algorithm requires a non-MapReduce “driver” program. checks to see if a termination condition has been met. we iterate the algorithm until there are no more node distances that are ∞. Finally. 152].

it is not difﬁcult to construct graphs that will elicit this worse-case behavior: Figure 5.6 provides an example. such extreme cases are rare. n4 . Three more iterations are necessary to cover the rest of the graph. . for most real-world graphs. we might need as many iterations as there are nodes in the graph minus one.6: A sample graph that elicits worst-case behavior for parallel breadth-ﬁrst search. PARALLEL BREADTH-FIRST SEARCH search frontier 93 r s p q Figure 5. the shortest path from source s to node r may go outside the current search frontier. In fact.2. with n1 as the source.5: In the single-source shortest path problem with arbitrary edge distances. with sufﬁcient iterations. The conclusion is that. we’ll eventually discover all the shortest distances. and the number of iterations necessary to discover all shortest distances is quite close to the diameter of the graph. we can repeat the same argument for all nodes on the path from s to p. as in the unit edge distance case. The parallel breadth-ﬁrst search algorithm would not discover that the shortest path from n1 to n6 goes through n3 . we would have already discovered it. in which case we will not ﬁnd the shortest distance to r until the search frontier expands to cover q. 1 n6 10 1 n7 n8 1 1 n1 1 1 1 n5 n9 n4 n2 1 n3 Figure 5. the above argument applies. how do we know when to stop iterating in the case of arbitrary edge distances? The algorithm can terminate when shortest distances at every node no longer change. and n5 until the ﬁfth iteration. Fortunately.5. Similarly. So exactly how many iterations does “eventually” mean? In the worst case. Eight iterations are required to discover shortest distances to all nodes from n1 . In practical terms. And if it doesn’t.

7 Unless the algorithm discovers an instance of the situation described in Figure 5. These inefﬁciencies represent the cost of parallelization. Dijkstra’s algorithm. At the end of each MapReduce iteration. we’re guaranteed to have already found the shortest path to it. parallel breadth-ﬁrst search in MapReduce can be characterized as a brute force approach that “wastes” a lot of time performing computations whose results are discarded.94 5.e. where the output of the previous iteration serves as input to the next iteration.g. and performs another computation (usually. Compared to Dijkstra’s algorithm on a single processor. The parallel breadth-ﬁrst search algorithm is instructive in that it represents the prototypical structure of a large class of graph algorithms in MapReduce. At each iteration. • The MapReduce algorithm maps over the node data structures and performs a computation that is a function of features of the node. In many cases. we increment a counter. The process is controlled by a non-MapReduce driver program that checks for termination. the data structure corresponding to each node is updated and written back to disk. Every time we encounter a shorter distance in the reducer. so no meaningful work is done. edge weights)..The results of these computations are emitted as values.. and features of the adjacency list (outgoing edges and their features). In other words. this is made possible by maintaining a global data structure (a priority queue) that holds nodes sorted by distance—this is not possible in MapReduce because the programming model does not provide support for global data that is mutable and accessible by the mappers and reducers. which is part of some larger node data structure that may contain additional information (variables to store intermediate output. the algorithm receives all partial results that have the same destination node. is far more efﬁcient. Conceptually. • Graph algorithms in MapReduce are generally iterative. GRAPH ALGORITHMS Once again. the graph itself is also passed from the mapper to the reducer. we can use counters to keep track of such events. keyed with the node ids of the neighbors (i. we can think of this as “passing” the results of the computation along outgoing edges. In the reducer. some form of aggregation). in which case. the algorithm is simply repeating previous computations. computations can only involve a node’s internal state and its local graph structure. on the other hand. . features of the nodes). They share in the following characteristics: • The graph structure is represented with adjacency lists. features are attached to edges as well (e. Every time a node is explored. However. updated distances will propagate inside the search frontier. but in reality only useful work is done along the search frontier: inside the search frontier. the algorithm hasn’t discovered any paths to nodes there yet. the driver program reads the counter value and determines if another iteration is necessary.5. intermediate state attached to each node. those nodes on the adjacency lists). In the reducer.7 Outside the search frontier. • In addition to computations. the algorithm attempts to recompute distances to all nodes.

5. the MapReduce algorithm for PageRank works in much the same way. the PageRank P of a page n is deﬁned as follows: P (n) = α 1 |G| + (1 − α) m∈L(n) P (m) C(m) (5. if a high-quality page links to another page. the mapper computation is the current distance plus edge distance (emitting distances to neighbors). however. Before the surfer decides where to go next.1) where |G| is the total number of nodes (pages) in the graph. and repeats ad inﬁnitum. L(n) is the set of pages that link to n. 5. A vivid way to illustrate PageRank is to imagine a random web surfer: the surfer visits a page. two other popular algorithms. are SALSA [88] and HITS [84] (also known as “hubs and authorities”). But enough about random web surﬁng. Formally. Nodes that have high in-degrees tend to have high PageRank values. (1 − α) is referred to as the “damping” factor. note that PageRank is deﬁned recursively—this gives rise to an iterative algorithm we will detail in a bit. As we will see in the next section. The random jump factor α is sometimes called the “teleportation” factor. First. PAGERANK 95 For parallel breadth-ﬁrst search. PageRank is a probability distribution over nodes in the graph representing the likelihood that a random walk over the link structure will arrive at a particular node. then the second page is likely to be high quality also. the surfer clicks on a random link on the page as usual. L(n). α is the random jump factor. while the reducer computation is the Min function (selecting the shortest path). More precisely. randomly clicks a link on that page. Let us consider a page m from the set of pages L(n): a random surfer at m will arrive at n with probability 1/C(m) since a link is . and C(m) is the out-degree of node m (the number of links on page m).3. As it turns out. The complete formulation of PageRank includes an additional component. Similarly. the surfer ignores the links on the page and randomly “jumps” or “teleports” to a completely different page. our web surfer doesn’t just randomly click links. A web page n receives PageRank “contributions” from all pages that link to it. Tails. as well as nodes that are linked to by other nodes with high PageRank values. a biased coin is ﬂipped—heads. PageRank is a measure of how frequently a page would be encountered by our tireless web surfer. alternatively. This behavior makes intuitive sense: if PageRank is a measure of page quality.3 PAGERANK PageRank [117] is a measure of web page quality based on the structure of the hyperlink graph. PageRank represents one particular approach to inferring the quality of a web page based on hyperlink structure. not covered here. we would expect high-quality pages to contain “endorsements” from many other pages in the form of hyperlinks. Although it is only one of thousands of features that is taken into account in Google’s search algorithm. Let us break down each component of the formula in detail. it is perhaps one of the best known and most studied.

g. where |G| is the number of nodes in the graph. we need to sum contributions from all pages that link to n. each node sums up all PageRank contributions that have been passed to it and computes an updated PageRank score. we ignore the random jump factor for now (i. it “passed along” more PageRank mass than it received. for example. of course. then the surfer could end up either in n2 or n4 with a probability of 0. its updated PageRank value is smaller than before. nodes with no outgoing edges). The algorithm begins by initializing a uniform distribution of PageRank values across nodes. Figure 5. Note that PageRank assumes a community of honest users who are not trying to “game” the measure.1 PageRank mass to n4 .7 shows a toy example that illustrates two iterations of the algorithm. The same occurs for all the other nodes in the graph: note that n5 must split its PageRank mass three ways. spammers. At the beginning of each iteration.This algorithm iterates until PageRank values don’t change anymore.) who are trying to manipulate search results—to promote a cause. since it has three neighbors. partial PageRank contributions are sent from each node to its neighbors connected via outgoing links. To conclude the iteration. For example. and n4 receives all the mass belonging to n3 because n3 isn’t connected to any other node.e. However. To compute the PageRank of n.. a node passes its PageRank contributions to other nodes that it is connected to. or service. α = 0) and further assume that there are no dangling nodes (i. GRAPH ALGORITHMS selected at random from all outgoing links. and with probability 1 − α the random surfer follows a hyperlink. This is the summation in the second half of the equation. and the second iteration in our toy example is illustrated by the bottom two graphs. 63]).1 PageRank mass to n2 and 0.. This makes sense in terms of the random surfer model: if the surfer is at n1 with a probability of 0. [12.This is. As a simpliﬁcation.2. We start by presenting an informal sketch.1 each. not true in the real world. we can think of this as spreading probability mass to neighbors via outgoing links. i. etc. The exact same process repeats. the probability of arriving at n from m is P (m)/C(m).. The fact that PageRank is recursively deﬁned translates into an iterative algorithm which is quite similar in basic structure to parallel breadth-ﬁrst search. At the beginning of each iteration. In the beginning of the ﬁrst iteration (top. Of course. The end of the ﬁrst iteration is shown in the top right: each node sums up PageRank contributions from its neighbors. activists. n1 sends 0. the two contributions need to be combined: with probability α the random surfer executes a random jump. Note that since n1 has only one incoming link. PageRank mass is . A simple example is a so-called “spider trap”. from n3 . Since PageRank is a probability distribution.e.e. We can think of this as gathering probability mass passed to a node via its incoming links. a inﬁnite chain of pages (e. or in some cases. the PageRank values of all nodes sum to one. left). where an adversarial relationship exists between search engine companies and a host of other organizations and individuals (marketers. to trap and intentionally deceive users (see. Since the PageRank value of m is the probability that the random surfer will be at m.. we also need to take into account the random jump: there is a 1/|G| chance of landing at any particular page. PageRank is only one of thousands of features used in ranking web pages. generated by CGI) that all link to a single page (thereby artiﬁcially inﬂating its PageRank). product. For this reason.96 5.

PAGERANK 97 Iteration 1 It ti n1 (0.1 0.3 n4 (0 3) (0. guaranteeing that we continue to have a valid probability distribution at the end of each iteration.5.1 0. we can think of this as passing PageRank mass along outgoing edges. the graph structure itself must be passed from iteration to iteration. Each node data structure is emitted in the mapper and written back out to disk in the reducer. Pseudo-code of the MapReduce PageRank algorithm is shown in Figure 5.1 n2 (0. Left graphs show PageRank values at the beginning of each iteration and how much PageRank mass is passed to each neighbor.1 01 0.3) 0. nodes on the adjacency list).066 0. An illustration of the running algorithm is shown in Figure 5.166) n4 (0. In the shufﬂe and sort phase.1) n2 (0. In the reducer.e.8. keyed by the node ids of the neighbors.2 n4 (0. Each piece of the PageRank mass is emitted as the value.7.066) 0.083 n2 (0.183) n4 (0 2) (0.9 for the ﬁrst iteration of the toy graph in Figure 5.166) n5 (0.2) 0. all edges that point to the same node). The algorithm maps over the nodes.3) Iteration 2 n1 (0.1 0.166) 0..383) n3 (0.066 n5 (0.066) n2 (0.2) 0.2) 0.033 0.133) 0.3) 0.e.3) n3 (0. top and bottom. As with the parallel breadth-ﬁrst search algorithm. and for each node computes how much PageRank mass needs to be distributed to its neighbors (i. Conceptually.2) 0.1 n1 (0. All PageRank mass . preserved by the algorithm.2) Figure 5.7: PageRank toy example showing two iterations. it is simpliﬁed in that we continue to ignore the random jump factor and assume no dangling nodes (complications that we will return to later).066 0.2) n5 (0. PageRank mass contributions from all incoming edges are summed to arrive at the updated PageRank value for each node.2 n3 (0.083 n1 (0..166 n3 (0.3. Right graphs show updated PageRank values at the end of each iteration.166) 0. the MapReduce execution framework groups values (piece of PageRank mass) passed along the graph edges by destination node (i.1 01 n5 (0.033 0.

PageRank/|N. .8 with counters: whenever the mapper processes a node with an empty adjacency list.. p2 . the sum of all the updated PageRank values should remain a valid probability distribution.AdjacencyList| Emit(nid n. let us now take into account the random jump factor and dangling nodes: as it turns out both are treated similarly. i. . N ) for all nodeid m ∈ N. In the hyperlink graph of the web. If we simply run the algorithm in Figure 5. Dangling nodes are nodes in the graph that have no outgoing edges. .8: Pseudo-code for PageRank in MapReduce (leaving aside dangling nodes and the random jump factor). node N) p ← N. their adjacency lists are empty. . [p1 . emitted by the mappers is accounted for in the reducer: since we begin with the sum of PageRank values across all nodes equal to one. since no key-value pairs will be emitted when a dangling node is encountered in the mappers. the total PageRank mass will not be conserved. One simple approach is by instrumenting the algorithm in Figure 5. GRAPH ALGORITHMS 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: class Mapper method Map(nid n. p) class Reducer method Reduce(nid m. At the end of the iteration.PageRank← s Emit(nid m.] do if IsNode(p) then M←p else s ←s+p M. Each MapReduce job corresponds to one iteration of the algorithm.]) M←∅ for all p ∈ counts [p1 .8 on graphs with dangling nodes. [22]). . The proper treatment of PageRank mass “lost” at the dangling nodes is to redistribute it across all nodes in the graph evenly (cf. In the map phase we evenly divide up each node’s PageRank mass and pass each piece along outgoing edges to neighbors. node M) Pass along graph structure Pass PageRank mass to neighbors Recover graph structure Sum incoming PageRank contributions Figure 5.e. . There are many ways to determine the missing PageRank mass. these might correspond to pages in a crawl that have not been downloaded yet. In the reduce phase PageRank contributions are summed up at each destination node. we can access the counter to ﬁnd out how much . Having discussed the simpliﬁed PageRank algorithm in MapReduce. it keeps track of the node’s PageRank value in the counter. p2 .98 5.AdjacencyList do Emit(nid m.

5.3. PAGERANK

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

99

Map

n2 n4 n3 n5 n4 n5 n1 n2 n3

n1

n2

n2

n3

n3

n4

n4

n5

n5

Reduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

Figure 5.9: Illustration of the MapReduce PageRank algorithm corresponding to the ﬁrst iteration in Figure 5.7. The size of each box is proportion to its PageRank value. During the map phase, PageRank mass is distributed evenly to nodes on each node’s adjacency list (shown at the very top). Intermediate values are keyed by node (shown inside the boxes). In the reduce phase, all partial PageRank contributions are summed together to arrive at updated values.

PageRank mass was lost at the dangling nodes.8 Another approach is to reserve a special key for storing PageRank mass from dangling nodes. When the mapper encounters a dangling node, its PageRank mass is emitted with the special key; the reducer must be modiﬁed to contain special logic for handling the missing PageRank mass. Yet another approach is to write out the missing PageRank mass as “side data” for each map task (using the in-mapper combining technique for aggregation); a ﬁnal pass in the driver program is needed to sum the mass across all map tasks. Either way, we arrive at the amount of PageRank mass lost at the dangling nodes—this then must be redistribute evenly across all nodes. This redistribution process can be accomplished by mapping over all nodes again. At the same time, we can take into account the random jump factor. For each node, its current PageRank value p is updated to the ﬁnal PageRank value p according to the following formula: p =α 1 |G| + (1 − α) m +p |G| (5.2)

where m is the missing PageRank mass, and |G| is the number of nodes in the entire graph. We add the PageRank mass from link traversal (p, computed from before) to the share of the lost PageRank mass that is distributed to each node (m/|G|). Finally, we take into account the random jump factor:

8 In Hadoop, counters are 8-byte integers: a simple workaround is to multiply PageRank values by a large constant, and then cast

as an integer.

100

5. GRAPH ALGORITHMS

with probability α the random surfer arrives via jumping, and with probability 1 − α the random surfer arrives via incoming links. Note that this MapReduce job requires no reducers. Putting everything together, one iteration of PageRank requires two MapReduce jobs: the ﬁrst to distribute PageRank mass along graph edges, and the second to take care of dangling nodes and the random jump factor. At the end of each iteration, we end up with exactly the same data structure as the beginning, which is a requirement for the iterative algorithm to work. Also, the PageRank values of all nodes sum up to one, which ensures a valid probability distribution. Typically, PageRank is iterated until convergence, i.e., when the PageRank values of nodes no longer change (within some tolerance, to take into account, for example, ﬂoating point precision errors). Therefore, at the end of each iteration, the PageRank driver program must check to see if convergence has been reached. Alternative stopping criteria include running a ﬁxed number of iterations (useful if one wishes to bound algorithm running time) or stopping when the ranks of PageRank values no longer change. The latter is useful for some applications that only care about comparing the PageRank of two arbitrary pages and do not need the actual PageRank values. Rank stability is obtained faster than the actual convergence of values. In absolute terms, how many iterations are necessary for PageRank to converge? This is a difﬁcult question to precisely answer since it depends on many factors, but generally, fewer than one might expect. In the original PageRank paper [117], convergence on a graph with 322 million edges was reached in 52 iterations (see also Bianchini et al. [22] for additional discussion). On today’s web, the answer is not very meaningful due to the adversarial nature of web search as previously discussed—the web is full of spam and populated with sites that are actively trying to “game” PageRank and related hyperlink-based metrics. As a result, running PageRank in its unmodiﬁed form presented here would yield unexpected and undesirable results. Of course, strategies developed by web search companies to combat link spam are proprietary (and closely guarded secrets, for obvious reasons)—but undoubtedly these algorithmic modiﬁcations impact convergence behavior. A full discussion of the escalating “arms race” between search engine companies and those that seek to promote their sites is beyond the scope of this book.9

5.4

ISSUES WITH GRAPH PROCESSING

The biggest difference between MapReduce graph algorithms and single-machine graph algorithms is that with the latter, it is usually possible to maintain global data structures in memory for fast, random access. For example, Dijkstra’s algorithm uses a global priority queue that guides the expansion of nodes. This, of course, is not possible with MapReduce—the programming model does not provide any built-in mechanism for communicating global state. Since the most natural representation of large sparse graphs is with adjacency lists, communication can only occur from a node to the

9 For the interested reader, the proceedings of a workshop series on Adversarial Information Retrieval (AIRWeb) provide great

starting points into the literature.

5.4. ISSUES WITH GRAPH PROCESSING

101

nodes it links to, or to a node from nodes linked to it—in other words, passing information is only possible within the local graph structure.10 This restriction gives rise to the structure of many graph algorithms in MapReduce: local computation is performed on each node, the results of which are “passed” to its neighbors. With multiple iterations, convergence on the global graph is possible. The passing of partial results along a graph edge is accomplished by the shufﬂing and sorting provided by the MapReduce execution framework. The amount of intermediate data generated is on the order of the number of edges, which explains why all the algorithms we have discussed assume sparse graphs. For dense graphs, MapReduce running time would be dominated by copying intermediate data across the network, which in the worst case is O(n2 ) in the number of nodes in the graph. Since MapReduce clusters are designed around commodity networks (e.g., gigabit Ethernet), MapReduce algorithms are often impractical on large, dense graphs. Combiners and the in-mapper combining pattern described in Section 3.1 can be used to decrease the running time of graph iterations. It is straightforward to use combiners for both parallel breadth-ﬁrst search and PageRank since Min and sum, used in the two algorithms, respectively, are both associative and commutative. However, combiners are only effective to the extent that there are opportunities for partial aggregation—unless there are nodes pointed to by multiple nodes being processed by an individual map task, combiners are not very useful. This implies that it would be desirable to partition large graphs into smaller components where there are many intra-component links and fewer inter-component links. This way, we can arrange the data such that nodes in the same component are handled by the same map task—thus maximizing opportunities for combiners to perform local aggregation. Unfortunately, this sometimes creates a chicken-and-egg problem. It would be desirable to partition a large graph to facilitate efﬁcient processing by MapReduce. But the graph may be so large that we can’t partition it except with MapReduce algorithms! Fortunately, in many cases there are simple solutions around this problem in the form of “cheap” partitioning heuristics based on reordering the data [106]. For example, in a social network, we might sort nodes representing users by zip code, as opposed to by last name—based on the observation that friends tend to live close to each other. Sorting by an even more cohesive property such as school would be even better (if available): the probability of any two random students from the same school knowing each other is much higher than two random students from different schools. Another good example is to partition the web graph by the language of the page (since pages in one language tend to link mostly to other pages in that language) or by domain name (since inter-domain links are typically much denser than intra-domain links). Resorting records using MapReduce is both easy to do and a relatively cheap operation—however, whether the efﬁciencies gained by this crude form of partitioning are worth the extra time taken in performing the resort is an empirical question that will depend on the actual graph structure and algorithm.

10 Of course, it is perfectly reasonable to compute derived graph structures in a pre-processing step. For example, if one wishes to

propagate information from a node to all nodes that are within two links, one could process graph G to derive graph G , where there would exist a link from node ni to nj if nj was reachable within two link traversals of ni in the original graph G.

Its use may further improve the accuracy of implementations that use log probabilities. • Algorithms are iterative and under the control of a non-MapReduce driver program. when summing PageRank contribution for a node.5 SUMMARY AND ADDITIONAL READINGS This chapter covers graph algorithms in MapReduce. The ubiquity of large graphs translates into substantial interest in scalable graph algorithms using MapReduce in industry. many math libraries include a log1p function which computes log(1 + x) with higher precision than the naïve implementation would have when x is very small (as is often the case when working with probabilities). .11 One implication of this is that communication between pairs of arbitrary nodes is difﬁcult to accomplish. A very common solution to this problem is to represent probabilities using their logarithms. Both are instances of a large class of iterative algorithms that share the following characteristics: • The graph structure is represented with adjacency lists.102 5. Additional Readings. When probabilities are stored as logs.g. This can be implemented with reasonable precision as follows: a⊕b = b + log(1 + ea−b ) a + log(1 + eb−a ) a<b a≥b Furthermore. For additional material. a distributed database). much beyond what has been covered in this chapter. 5. For large graphs. addition of probabilities is also necessary. of course. information typically propagates along graph edges—which gives rise to the structure of algorithms discussed above. There is. which checks for termination at the end of each iteration. GRAPH ALGORITHMS Finally. such that the output is in the same form as the input. the probability of any particular node is often so small that it underﬂows standard ﬂoating point representations. the product of two values is simply their sum.. maintaining globally-synchronized state may be possible with the assistance of other tools (e. Partial results are aggregated for each node in the reducer. discussing in detail parallel breadth-ﬁrst search and PageRank. However. for example. The MapReduce programming model does not provide a mechanism to maintain global data structures accessible and mutable by all the mappers and reducers. we refer readers to the 11 However. Instead. • The graph structure itself is passed from the mapper to the reducer. academia. there is a practical consideration to keep in mind when implementing graph algorithms that estimate probability distributions over nodes (such as PageRank). and beyond. • Algorithms map over nodes and pass partial results to nodes on their adjacency lists.

SUMMARY AND ADDITIONAL READINGS 103 following: Kang et al. particular in the PRAM model [60.5. it is easy to forget that parallel graph algorithms have been studied by computer scientists for several decades. [80] presented an approach to estimating the diameter of large graphs using MapReduce and a library for graph mining [81]. . It is not clear. on graphs derived from textual data. however. Rao and Yarowsky [128] described an implementation of label propagation. Finally. Cohen [39] discussed a number of algorithms for processing undirected graphs. a standard algorithm for semi-supervised machine learning.5. to what extent well-known PRAM algorithms translate naturally into the MapReduce framework. Schatz [132] tackled the problem of DNA sequence alignment and assembly with graph algorithms in MapReduce. with social network analysis in mind. 77].

.

and developing systems that can deal with inputs from diverse domains is very labor intensive. Second. signiﬁcant quantities of training data may even exist for independent reasons (e.g. translations of text into multiple languages are created by authors wishing to reach an audience speaking different languages. Since data-driven systems can be trained using examples of the kind that they will eventually be used to process. developing training data tends to be far less expensive than developing rules. and transform text input. Next. the rule-based approach has largely been abandoned in favor of more data-driven methods. the system can be thought of as having the potential to produce any output for any input. when these systems fail. This rule-based approach can be appealing: a system’s behavior can generally be understood and predicted precisely. an active area of research. rule-based systems suffer from a number of serious problems. These advantages come at the cost of . For some applications. The learning process. text processing systems tended to rely on large numbers of manually written rules to analyze. which often involves iterative algorithms.. usually in a deterministic way. where the “rules” for processing the input are inferred automatically from large corpora of examples.. the syntactic structure of the sentence or a classiﬁcation of the email as spam). annotate. Furthermore. In the last 20 years. typically consists of activities like ranking rules. called training data. they can be corrected by writing new rules or reﬁning old ones. unable to offer even a “best guess” as to what the desired analysis of the input might be.105 CHAPTER 6 EM Algorithms for Text Processing Until the end of the 1980s. and. Data-driven approaches have turned out to have several beneﬁts over rule-based approaches to system development. The basic strategy of the data-driven approach is to start with a processing algorithm capable of capturing how any instance of the kinds of inputs (e. a learning algorithm is applied which reﬁnes this process based on the training data—generally attempting to make the model perform as well as possible at predicting the examples in the training data.g. not because they are generating training data for a data-driven machine translation system). they tend to deal with the complexities found in real data more robustly than rule-based systems do. They are brittle with respect to the natural variation found in language. but they are not distinguished in any way.. they often do so catastrophically. This is known as machine learning. At this stage. when errors surface. instantiating the content of rule templates. However. or determining parameter settings for a given model. sentences or emails) can relate to any instance of the kinds of outputs that the ﬁnal system should produce (e.g.

is beyond the scope of this chapter. setting those parameters. for something like email classiﬁcation or machine translation. They tend to be sufﬁcient for text processing. which deﬁne a joint distribution over sequences of inputs and sequences of annotations. we introduce hidden Markov models (HMMs). Inference in and learning of so-called nonparameteric models. but distinct. For machine translation. correcting errors that the trained system makes can be quite challenging. which assigns a probability to every y ∈ Y .NotSpam} and X be the set of all possible email messages. the probabilities cannot be represented directly. or. y) y∈Y y Pr(x. documents. y ) 1 In this chapter. we will consider discrete models only. This model may take the form of either a joint model Pr(x.). etc. we might have Y = {Spam. As a result. This entails selecting a representation of a joint or conditional distribution over the desired X and Y . and one learning algorithm that attempts to meet this criterion.. Data-driven information processing systems can be constructed using a variety of mathematical techniques. called expectation maximization (EM). . which probabilistically relate inputs from an input set X (e. the search is also straightforward.106 6. which involves the application of an optimization algorithm and training criterion to select the parameters of the model to optimize the model’s performance (with respect to the given training criterion) on the training data. y) = arg max Pr(x. It should be kept in mind that the sets X and Y may still be countably inﬁnite. this is a straightforward search for the best y under the model. EM ALGORITHMS FOR TEXT PROCESSING systems that often behave internally quite differently than a human-engineered system. The ﬁrst is model selection. and their presentation is 2 We restrict our discussion in this chapter to models with ﬁnite numbers of parameters and where the learning process refers to simpler than models with continuous densities. sentences. The second challenge is parameter estimation or learning. which are always observable. In a joint model. which says to select the parameters that make the training data most probable under the model. given some x. but in this chapter we focus on statistical models. The ﬁnal challenge for statistical modeling is the problem of decoding. For a problem where X and Y are very small. X might be the set of Arabic sentences and Y the set of English sentences.g. As an example of such models. One very common strategy is to select y according to the following criterion: y ∗ = arg max Pr(y|x) y∈Y In a conditional (or direct) model. y ∈ X × Y or a conditional model Pr(y|x). challenges in statistical text processing. which is the space of possible annotations or analyses that the system should predict. However. to create a statistical spam detection system. where the model space is inﬁnite.1 There are three closely related. y) which assigns a probability to every pair x. one could imagine representing these probabilities in lookup tables. on account of the deﬁnition of conditional probability: y ∗ = arg max Pr(y|x) = arg max y∈Y y∈Y Pr(x. and must be computed algorithmically. given an x ∈ X . to annotations from a set Y . maximum likelihood estimation. For example. In this chapter we will focus on one particularly simple training criterion for parameter estimation. which have an inﬁnite number of parameters and have become important statistical models for text processing in recent years. using the model to select an annotation y.2 The parameters of a statistical model are the values used to compute the probability of some event described by the model.

As the annotation task becomes more complicated (e. . Section 6. in the case of unsupervised learning. in the case of spam detection).. a very versatile class of models that uses EM for parameter estimation. but we will touch on the third problem as well. . we describe maximum likelihood estimation for statistical models. While a thorough discussion of unsupervised learning is beyond the scope of this book. and are used to solve a number of problems of interest in text processing. the number of potential examples that could be classiﬁed (representing a subset of X . Our focus in this chapter will primarily be on the second problem: learning parameters for models.g.5 examines . xi to be paired with its correct label.. but without any example annotations. the yi ’s are unobserved). .3 discusses how EM algorithms can be expressed in MapReduce. data where some of the variables in the model cannot be observed. Unsupervised learning.e. the learning criteria and model structure (which crucially deﬁne the space of possible annotations Y and the process by which annotations relate to observable inputs) make it possible to induce annotations by relying on regularities in the unclassiﬁed training instances. and then introduce expectation maximization (EM)... that is Z = x1 . they are often uneconomical to use. y2 .e.2. In Section 6. This chapter is organized as follows. meaning that each xi must be manually labeled by a trained individual. these gold standard training labels must be generated by a process of expert annotation.g. and then in Section 6. we focus on a particular class of algorithms— expectation maximization (EM) algorithms—that can be used to learn the parameters of a joint model Pr(x. yi ∈ X × Y and yi is the gold standard (i. where xi . We describe hidden Markov models (HMMs) in Section 6.4 we look at a case study of word alignment for statistical machine translation. Expectation maximization algorithms ﬁt naturally into the MapReduce paradigm. that is Z = x1 . Machine learning is often categorized as either supervised or unsupervised. While it may at ﬁrst seem counterintuitive that meaningful annotations can be learned without any examples of the desired annotations being given.107 The speciﬁc form that the search takes will depend on how the model is represented. since they generally require repeated evaluations of the training data. show how this is generalized to models where not all variables are observable. Furthermore. annotation becomes far more challenging. when predicting more complex structures such as sequences of labels or when the annotation task requires specialized expertise). where xi ∈ X . x2 . correct) annotation of xi . Supervised learning of statistical models simply means that the model parameters are estimated from training data consisting of pairs of inputs and annotations. since the training data requires each object that is to be classiﬁed (to pick a speciﬁc task). . requires only that the training data consist of a representative collection of objects that should be annotated. y1 . but also to improve efﬁciency bottlenecks at scales where non-parallel solutions could be utilized. Even when the annotation task is quite simple for people to carry out (e. on the other hand.1. MapReduce therefore provides an opportunity not only to scale to larger amounts of data. yi . While supervised models often attain quite good performance. y) from incomplete data (i. x2 . Section 6. In many cases. . . these algorithms can be quite computationally expensive. which may of course be inﬁnite in size) will far exceed the amount of data that can be annotated.

. EM ALGORITHMS FOR TEXT PROCESSING similar algorithms that are appropriate for supervised learning tasks. being diverted to the left or right by the peg (indicated by a triangle) in the center. Our task is to construct a model that predicts which cup the ball will drop into. This chapter concludes with a summary and pointers to additional readings.1) To illustrate. alignments between texts in different languages. constituency and dependency trees. those characterizing the latent structure that is sought).1 MAXIMUM LIKELIHOOD ESTIMATION Maximum likelihood estimation (MLE) is a criterion for ﬁtting the parameters θ of a statistical model to some given data x. and it bounces down into one of the cups at the bottom of the board. A “rule-based” approach might be to take exact measurements of the board and construct a physical model that we can use to predict the behavior of the ball. In this game. 6. on the other hand. so an observation of X is equivalent to an observation of Y .1. θ ) θ (6. so we drop 10 marbles into the game which end up in cups x = b. b. b. a.108 6. That is. Given sophisticated enough measurements. b. alignments between acoustic signals and their transcriptions. might be to assume that the behavior of the marble in this game can be modeled using a Bernoulli random variable Y with parameter p. Expectation maximization generalizes the principle of maximum likelihood estimation to the case where the values of some variables are unobserved (speciﬁcally.1 EXPECTATION MAXIMIZATION Expectation maximization (EM) algorithms [49] are a family of iterative optimization algorithms for learning probability distributions from incomplete data. b. Speciﬁcally. We also deﬁne a random variable X whose value is the label of the cup that the marble ends up in. To name just a few applications. b. the value of the random variable indicates whether path 0 or 1 is taken. it says to select the parameter settings θ ∗ such that the likelihood of observing the training data given the model is maximized: θ ∗ = arg max Pr(X = x. this could certainly lead to a very accurate model. consider the simple marble game shown in Figure 6. the construction of this model would be quite time consuming and difﬁcult. To estimate the parameter p of the statistical model of our game. we need some training data. They are extensively used in statistical natural language processing where one seeks to infer latent linguistic structure from unannotated text. 6. a . b. However. note that X is deterministically related to Y . as well as for numerous other clustering and structure discovery problems.1. a marble is released at the position indicated by the black dot. A statistical approach. b. EM algorithms have been used to ﬁnd part-of-speech sequences.

it is an empirical question whether the model works well enough in practice to be useful. if insufﬁcient resources were invested in building a physical model. Additionally. What is the maximum likelihood estimate of p given this data? By assuming that our samples are independent and identically distributed (i. p) dp d[2 · log p + 8 · log(1 − p)] dp 2 8 − p 1−p = 0 = 0 = 0 Solving for p yields 0. which is the intuitive result.1. the maximum likelihood estimate of p is N1 /(N0 + N1 ). EXPECTATION MAXIMIZATION 109 0 1 a b Figure 6.a) (1 − p)δ(xj . 3 In this equation. it is straightforward to show that in N trials where N0 marbles followed path 0 to cup a.d. While this model only makes use of an approximation of the true physical process at work when the marble interacts with the game board. Furthermore. which indicates the probability that the marble will go to the right when it hits the peg. . while a Bernoulli trial is an extreme approximation of the physical process.b) = p · (1 − p)8 Since log is a monotonically increasing function.).i. This game can be modeled using a Bernoulli random variable with parameter p.6. maximizing log Pr(x. and N1 marbles followed path 1 to cup b. p) will give us the desired result.1: A simple marble game where a released marble takes one of two possible paths. p) = j =1 2 p δ(xj . we can write the likelihood of our data as follows:3 10 Pr(x. δ is the Kroneker delta function which evaluates to 1 where its arguments are equal and 0 otherwise.2. We can do this differentiating with respect to p and ﬁnding where the resulting expression equals 0: d log Pr(x.

EM ALGORITHMS FOR TEXT PROCESSING 0 1 2 3 a b c Figure 6. Y = y) is determined by θ. and Y . corresponding to the probabilities that the marble will go to the right at the top. 6. To construct a statistical model of this game. This version consists of three pegs that inﬂuence the marble’s path. and right pegs. 3} indicating which path was taken. b. we again assume that the behavior of a marble interacting with a peg can be modeled with a Bernoulli random variable.110 6. This sort of dynamic is found often in text processing problems: given enough data. consider a more complicated variant of our marble game shown in Figure 6. How should the parameters θ be estimated? If it were possible to observe the paths taken by marbles as they were dropped into the game. 1.2 A LATENT VARIABLE MARBLE GAME To see where latent variables might come into play in modeling. left. Note that both paths 1 and 2 lead to cup b. 2.2. the approximation may perform better than the more complicated “rule-based” model. Note that the full joint distribution Pr(X = x. If Nx counts the number of times a marble took path x in N trials. Since there are three pegs. p2 . We further deﬁne a random variable X taking on values from {a. We assume that we can only observe which cup the marble ends up in.1. c} indicating what cup the marble ends in. astonishingly simple models can outperform complex knowledge-intensive models that attempt to simulate complicated processes. not the speciﬁc path taken. taking on values from {0. p1 .2: A more complicated marble game where the released marble takes one of four possible paths. this is: p0 = N2 + N 3 N p1 = N1 N0 + N 1 p2 = N3 N2 + N 3 . and the marble may end up in one of three cups. it would be trivial to estimate the parameters for our model using the maximum likelihood estimator—we would simply need to count the number of times the marble bounced left or right at each peg. we have three random variables with parameters θ = p0 .

we consider the problem of estimating parameters for statistical models of the form Pr(X.4). θ ) Thus.i. In other words. since only the values of the random variable X are observable. θ ) = j =1 y∈Y Pr(X = xj . θ) which describe not only an observable variable X but a latent. we want to consider the case where we have partial data. that is. Y = y. we assume that the variables in our model take on discrete values. but the iterative hill-climbing approach of expectation maximization can be used instead.6. such as parse trees or part-of-speech tags. if we assume the samples are i. θ ) Unfortunately.d. which takes on values from set designated Y :4 Again. we deﬁne our optimization criterion to be the maximization of the marginal likelihood. In these models. θ ) For a vector of training observations x = x1 . but discrete models are widely used in text processing. summing over all settings of the latent variable Y . we wish to consider the case where the paths taken are unobservable (imagine an opaque sheet covering the center of the game board). and there is data available from just one of them. such algorithms are quite useful in text processing. EXPECTATION MAXIMIZATION 111 However.: |x| Pr(x. This is exactly the problem encountered in unsupervised learning: there is a statistical model describing the relationship between two sets of variables (X’s and Y ’s). variable Y.3 MLE WITH LATENT VARIABLES Formally.i. 4 For this description. . the maximum (marginal) likelihood estimate of the model parameters θ ∗ given a vector of i. . Y = y. x2 .d. Not only does this simplify exposition.: Pr(X = x) = y∈Y Pr(X = x.d. where latent variables may describe latent linguistic structures of the observed variables. in many cases. or hidden. 6. x . Y = y.1. Furthermore. observations x becomes: θ = arg max θ j =1 y∈Y ∗ |x| Pr(X = xj . . . or alignment structures relating sets of observed variables (see Section 6.1. . this maximum cannot be computed analytically. we assume that samples in the training data x are i. Y .i. but where we can see what cup a marble ends in.

1. the steps are as follows: E-step. summing 5The term ‘expectation’ is used since the values computed in terms of the posterior distribution Pr(y|x. The process then repeats with these new parameters. it will also be the global maximum. Before continuing.5 and maximization (M-step). θ (i) ) The algorithm starts with some initial set of parameters θ (0) and then updates them using two steps: expectation (E-step). Y = y. but this is provably true [78]. we note that the effective application of expectation maximization requires that both the E-step and the M-step consist of tractable computations. Compute the posterior probability of each possible hidden variable assignments y ∈ Y for each x ∈ X and the current parameter settings. . which computes the posterior distribution over the latent variables given the observable data x and a set of parameters θ (i) . The algorithm terminates when the likelihood remains unchanged. y. . y) = 1. θ (i+1) ) ≥ j =1 y∈Y j =1 y∈Y Pr(X = xj .Y =y. Y = y.4 EXPECTATION MAXIMIZATION Expectation maximization (EM) is an iterative algorithm that ﬁnds a successive series of parameter estimates θ (0) . θ (i) ) and note that it deﬁnes a joint probability distribution over X × Y in that (x. That is. Y = y. but if the model is fully convex. Y = y. θ ) q(x. that improve the marginal likelihood of the training data.y)∈X ×Y q(x. y. EM ALGORITHMS FOR TEXT PROCESSING 6. θ (i) ) that are required to 6The ﬁnal solution is only guaranteed to be a local maximum. Pr(x.112 6.y)∈X ×Y q(X = x.θ (i) ) log Pr(X = x. which computes new parameters θ (i+1) maximizing the expected log likelihood of the joint distribution with respect to the distribution computed in the E-step. solve the M-step have the form of an expectation (with respect to this distribution). Y = y. . Call this q(X = x. θ (i) ) (i) y Pr(x. Compute new parameter settings that maximize the expected log of the probability of the joint distribution under the q-distribution that was computed in the E-step: θ (i+1) = arg max Eq(X=x. θ (1) . Speciﬁcally. θ (i) ) = f (x|x) · Pr(Y = y|X = x. EM guarantees: |x| |x| Pr(X = xj .6 In more detail. θ (i) ) = f (x|x) · M-step. θ ) We omit the proof that the model with parameters θ (i+1) will have equal or greater marginal likelihood on the training data than the model with parameters θ (i) . θ (i) ) · log Pr(X = x. weighted by the relative frequency with which x occurs in x. y . θ ) θ θ = arg max (x. . Y = y.

We need to compute the distribution q(X = x. Y . θ ) log(1 − p0 ) + log(1 − p1 ) log(1 − p0 ) + log p1 log p0 + log(1 − p2 ) log p0 + log p2 Multiplying across each row and adding from top to bottom yields the expectation we wish to maximize. Each parameter can be optimized independently using differentiation.1. and c. Pr(Y = y|X = x) is zero for all other values of x and y (regardless of the value of the parameters). We ﬁrst note that the relative frequency f (x|x) is: Nx N Next. θ (i) ) Nb /N · Pr(2|b. E-step. Y = y. The posterior probability of paths 1 and 2 are only non-zero when X is b: f (x|x) = (1 − p0 )p1 (i) (i) (i) (i) (i) (i) Pr(1|b. θ (i) ) = (1 − p0 )p1 + p0 (1 − p2 ) Pr(2|b. θ (i) ) = p0 (1 − p2 ) (1 − p0 )p1 + p0 (1 − p2 ) (i) (i) (i) (i) (i) (i) Except for the four cases just described. 6. θ (i) ) · Nb p2 = Nc Pr(2|b. and Nc indicating the number of marbles ending in cups a. EXPECTATION MAXIMIZATION 113 over the space of hidden variable assignments must not be intractable.1. Depending on the independence assumptions made in the model. as deﬁned above. this may be achieved through techniques such as dynamic programming. Nb . θ (i) ) · Nb + Nc . we observe that Pr(Y = 0|X = a) = 1 and Pr(Y = 3|X = c) = 1 since cups a and c fully determine the value of the path variable Y . θ ) (which will be a function in terms of the three parameter variables) under the q-distribution we computed in the E step. θ (i) ) · Nb Na + Pr(1|b. Y = y. However. M-step. some models may require intractable computations. We start with some parameters (0) (0) (0) θ (0) = p0 . The non-zero terms in the expectation are as follows: x a b b c y 0 1 2 3 q(X = x. p2 that have been randomly initialized to values between 0 and 1.1. p1 .2 using EM. θ (i) ) · Nb + Nc N p1 = Pr(1|b. The resulting optimal values are expressed in terms of the counts in x and θ (i) : p0 = Pr(2|b. Y = y. We assume training data x consisting of N = |x| observations of X with Na .6. b. We now need to maximize the expectation of log Pr(X. θ (i) ). θ (i) ) Na /N Nb /N · Pr(1|b.5 AN EM EXAMPLE Let’s look at how to estimate the parameters from our latent variable marble game from Section 6. θ (i) ) Nc /N log Pr(X = x.

where Aq (r) gives the probability of transitioning from state q to state r. not directly observable (i. stock market forecasting [70]. information extraction [139]. rather than depending on exact path counts. gene ﬁnding [143]. Following convention.). 6. However. at each time step. initial and ﬁnal states may be handled differently. and an |S |-dimensional vector π . such as words in a sentence. where πq is the probability that the process starts in state q. the values computed at the end of the M-step would serve as new parameters for another iteration of EM. r. hidden). The states of this Markov process are. In an HMM. θ . and s refer to states in S . Instead. These simple but powerful models have been used in applications as diverse as speech recognition [78]. Alternatively. A hidden Markov model M is deﬁned as a tuple S . or letter) is emitted according to a probability distribution conditioned on the identity of the state that the underlying process is in. This model is parameterized by the tuple θ = A.2 HIDDEN MARKOV MODELS To give a more substantial and useful example of models whose parameters may be estimated using EM. S is a ﬁnite set of states. which generate symbols from a ﬁnite observation vocabulary O. the PageRank algorithm considered in the previous chapter (Section 5. we assume that variables q. For most models.e. a word. the example we have presented here is quite simple and the model converges to a global optimum after a single iteration.4).. As another point of comparison. and word alignment of parallel (translated) texts [150] (more in Section 6. or continuous-valued observations may be emitted (for example. and it may not ﬁnd a global optimum. base pairs in a gene. an observable token (e. however.. In alternative deﬁnitions.3) can be understood as a Markov process: the probability of following any link on a particular page is independent of the path taken to reach that page. the ﬁnal parameter values depend on the values chose for θ (0) . π consisting of an |S | × |S | matrix A of transition probabilities. the data being modeled is posited to have been generated from an underlying Markov process. part-of-speech tagging [44]. EM ALGORITHMS FOR TEXT PROCESSING It is worth noting that the form of these expressions is quite similar to the fully observed maximum likelihood estimate. EM requires several iterations to converge. or letters in a word. but it is one that is useful for many text processing problems. B. etc. And since EM only ﬁnds a locally optimal solution. observations may be emitted during the transition between states.7 These matrices may be dense.g. and o refers to symbols in the observation vocabulary O. HMMs are models of data that are ordered sequentially (temporally. from a Gaussian distribution). O. However.114 6. where transitions are taken probabilistically. which is a stochastic process consisting of a ﬁnite set of states where the probability of entering a state at time t + 1 depends only on the state of the process at time t [130]. the statistics used are the expected path counts. given x and parameters θ (i) . base pair. one can view a Markov process as a probabilistic variant of a ﬁnite state machine. Typically. . we turn to hidden Markov models (HMMs). from left to right. an |S | × |O| matrix B of emission probabilities. text retrieval [108]. but for many applications sparse parameterizations 7This is only one possible deﬁnition of an HMM. where Bq (o) gives the probability that symbol o will be emitted from state q.

Step 3: t is incremented. noun. O. θ . y. Given a model M = S . and o. the word wash can be generated by an nn or v. as well as that: Aq (r) = 1 ∀q r∈S o∈O Bq (o) = 1 ∀q q∈S πq = 1 A sequence of observations of length τ is generated as follows: Step 0: let t = 1 and select an initial state q according to the distribution π. . and verb). . Figure 6. Y )? Pr(x. the process repeats from Step 1. θ ) Pr(x) = y∈Y 2. x = x1 . r.This example illustrates a key intuition behind many applications of HMMs: states correspond to equivalence classes or clustering of observations. θ and an observation sequence x. Bq (o) ≥ 0. nn. and an observation sequence of symbols from O. θ ) y∈Y 8The organization of this section is based in part on ideas from Lawrence Rabiner’s HMM tutorial [125]. . a. xτ . . Given a model M = S . and observations O = {the. adj.1 THREE QUESTIONS FOR HIDDEN MARKOV MODELS There are three fundamental questions associated with hidden Markov models:8 1. .6. Step 1: an observation symbol from O is emitted according to the distribution Bq . We further stipulate that Aq (r) ≥ 0. green. Step 2: a new q is drawn according to the distribution Aq . v } correspond to the parts of speech (determiner. and πq ≥ 0 for all q. x2 .2. adjective.3 shows a simple example of a hidden Markov model for part-of-speech tagging.} are a subset of English words. HIDDEN MARKOV MODELS 115 are useful. Since all events generated by this process are conditionally independent. what is the most likely sequence of states that generated the data? y∗ = arg max Pr(x. . which is the task of assigning to each word in an input sentence its grammatical category (one of the ﬁrst steps in analyzing textual content). y. States S = {det. since wash can either be a noun or a verb).2. O. . the joint probability of this sequence of observations and the state sequence used to generate it is the product of the individual event probabilities. . what is the probability that M generated the data (summing over all possible state sequences. 6. and if t ≤ τ . and a single observation type may associated with several clusters (in this example.

2 0.5 0. .2 0.19 0.5 0.1 might wash washes loves reads books 0.3 0.01 Example outputs: John might wash NN V V the big green person loves old plants DET ADJ ADJ NN V ADJ NN plants washes books books books NN V V NN V Figure 6.3 green big old might 0.1 0.4 0.2 0.3 0.7 0.3 0.1 0.2 0.1 0.1 book plants people person John wash 0.1 0.2 0.2 0. EM ALGORITHMS FOR TEXT PROCESSING Initial probabilities: DET ADJ NN V 0.3: An example HMM that relates part-of-speech tags to vocabulary items in an English-like language.1 0.1 0. Possible (probability > 0) transitions for the Markov process are shown graphically.5 0.3 0.7 0.1 DET NN V Emission probabilities: DET ADJ NN V the a 0.4 0. the state sequences corresponding to the emissions are written beneath the emitted symbols.1 0.4 0.116 6.7 0.2 0 0. In the example outputs.1 Transition probabilities: DET ADJ NN DET ADJ NN V V ADJ 0 0 0 0.

3. the probability that each generated x can be computed by looking up and multiplying the relevant probabilities in A. what are the parameters θ = A. it’s not hard to see that the values of α2 (r) for every r can be computed in terms of the |S | values in α1 (·) and the observation x2 : . we assume that M is deﬁned as shown in Figure 6. det . and a series of i. making exhaustive enumeration computationally intractable. We can make use of what is known as the forward algorithm to compute the desired probability in polynomial time. det. and then summing the result or taking the maximum. the answers to the ﬁrst two questions are in principle quite trivial to compute: by iterating over all state sequences Y . because the underlying model behaves exactly the same whenever it is in some state. even with all the distributed computing power MapReduce makes available. v . . det. for example x = John. regardless of how it got to that state. 6. the third question can be answered using EM.2. There are two ways to compute the probability of x having been generated by M. As indicated above. is much more efﬁcient. an observation vocabulary O. .}. HIDDEN MARKOV MODELS 117 3. O.2. Question 1 asks what is the probability that this sequence was generated by an HMM M = S . B. The second. θ ) Using our deﬁnition of an HMM. Given a set of states S . . we will quickly run into trouble if we try to use this naïve strategy since there are |S |τ distinct state sequences of length τ . wash . θ as deﬁned above. . observation sequences x1 .2 THE FORWARD ALGORITHM Given some observation sequence. x . we can use dynamic programming algorithms to answer all of the above questions without summing over exponentially many sequences. It is easy to see that the values of α1 (q) can be computed as the product of two independent probabilities: the probability of starting in state q and the probability of state q generating x1 : α1 (q) = πq · Bq (x1 ) From this. det. since the set of possible labels is exponential in the length of x. θ . xt ? Call this probability αt (q). as we hinted at in the previous section.i. . Fortunately. . The ﬁrst is to compute the sum over the joint probability of x and every possible labeling y ∈ { det . . and π . This algorithm works by recursively computing the answer to a related question: what is the probability that the process is in state q at time t and has generated x1 . det . det.d. O. y. Unfortunately. π that maximize the likelihood of the training data? θ ∗ = arg max θ i=1 y∈Y Pr(xi . Thus. αt (q) is a two dimensional matrix (of size |x| × |S |). nn . fortunately. might. And. We assume a model M = S . For the purposes of illustration.6. . . . x2 . . B. x2 . called a trellis. this is not feasible for most sequences.

consisting of 4 × 3 cells. . wash . Summing over the ﬁnal column also yields 0. . This is known as the Viterbi algorithm. . . 6. 2. so the answer to Question 1 can be computed simply by summing over α values at time |x| for all states: Pr(x. x2 . xt . . x2 . . The upper panel shows the naïve exhaustive approach. the naïve approach to solving this problem is to enumerate all possible labels and ﬁnd the one with the highest joint probability. . xt . The probability of the full sequence is the probability of being in time |x| and in any state. y ). because the behavior of a Markov process is determined only by the state it is in at some time (not by how it got to that state). having generated x1 . . .4 illustrates the two possibilities. the second question we might want to ask of M is: what is the most likely sequence of states that generated the observations? As with the previous question. αt (r) can always be computed in terms of the |S | values in αt−1 (·) and the observation xt : αt (r) = Br (xt ) · q∈S αt−1 (q) · Aq (r) We have now shown how to compute the probability of being in any state q at any time t. with the forward algorithm. |S | and transitioning to state r. θ ) = q∈S α|x| (q) In summary. The lower panel shows the forward trellis. Summing over all y . v. there are two ways of computing the probability that a sequence of observations x was generated by M: exhaustive enumeration with summing and the forward algorithm. Furthermore.4 shows that y∗ = nn. . EM ALGORITHMS FOR TEXT PROCESSING α2 (r) = Br (x2 ) · q∈S α1 (q) · Aq (r) This works because there are |S | different ways to get to state r at time t = 2: starting from state 1.118 6.00018. examining the chart of probabilities in the upper panel of Figure 6. However. Since we wish to be able to reconstruct the sequence . the Viterbi probability. Continuing with the example observation sequence x = John. the marginal probability of x is found to be 0. might.00018. Figure 6. a more efﬁcient answer to Question 2 can be computed using the same intuition in the forward algorithm: determine the best state sequence for a short sequence and extend this to easily compute the best sequence for longer ones. the same result. . enumerating all 43 possible labels y of x and computing their joint probability Pr(x.3 THE VITERBI ALGORITHM Given an observation sequence x. v is the most likely sequence of states under our example HMM.2. We deﬁne γt (q). to be the most probable sequence of states ending in state q at time t and generating observations x1 . .

θ) = 0.y) 0.0 John might wash NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN DET DET DET DET ADJ ADJ ADJ ADJ NN NN NN NN V V V V DET ADJ NN V DET ADJ NN V DET ADJ NN V DET ADJ NN V p(x. y.0 0.0 0.0 0.0 0.0 0.0 0.0 John might wash ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ ADJ DET DET DET DET ADJ ADJ ADJ ADJ NN NN NN NN V V V V DET ADJ NN V DET ADJ NN V DET ADJ NN V DET ADJ NN V p(x.y) 0.y) 0.0 0.000099 α1 Pr(x) = q∈S α2 α3 (q) = 0.0 0.0 0.4: Computing the probability of the sequence John.0 0.0 0.0 0.0 0.0 0.00018 John DET ADJ NN V 0.0 Pr(x) = y∈Y Pr(x.0 0.0 0.0 0.00009 John might wash V V V V V V V V V V V V V V V V DET DET DET DET ADJ ADJ ADJ ADJ NN NN NN NN V V V V DET ADJ NN V DET ADJ NN V DET ADJ NN V DET ADJ NN V p(x. might.2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000009 0.0 0.00018 α3 Figure 6.0 0.0 0.03 0.0 0.0 0.00006 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 by explicitly summing over all possible sequence labels (upper panel) and using the forward algorithm (lower panel).000081 0.0 0.0 0. wash under the HMM given in Figure 6.0 0.0 0. HIDDEN MARKOV MODELS 119 John might wash DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET DET ADJ ADJ ADJ ADJ NN NN NN NN V V V V DET ADJ NN V DET ADJ NN V DET ADJ NN V DET ADJ NN V p(x.0 0.0 might 0.003 wash 0.0 0. .0 0.6.0 0.0 0.0 0.0 0.0 0.0 0.0003 0.0 0.0 0.000021 0.y) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.

When this is the 9 Since an HMM models sequences. x2 . to be the state used in this sequence at time t − 1. x ?9 Since our model is constructed in terms of variables whose values we cannot observe (the state sequence) in the training data. that is. except rather than summing over previous states. Note that the backpointer simply records the index of the originating state—a separate computation is not necessary. since the formalism is cumbersome. . x1 . training data consist of a collection of example sequences. we will skip a detailed derivation. . . Deriving the EM update equations requires only the application of the techniques presented earlier in this chapter and some differential calculus.120 6. In this case. we may train it to optimize the marginal likelihood (summing over all state sequences) of x using EM. a training instance can be depicted as shown in Figure 6. what are the parameters θ ∗ = A. y∗ . . consider a fully observable HMM. we deﬁne bpt (q). to construct the rest of the sequence: ∗ y|x| ∗ yt−1 = arg max γ|x| (q) = bpt (yt ) q∈S Figure 6. B. In order to make the update equations as intuitive as possible. the maximum value of all possible trajectories into state r at time t is computed.6.2. and then the backpointers are followed. but readers interested in more information can ﬁnd it in the relevant citations [78. The base case for the recursion is as follows (the state index of −1 is used as a placeholder since there is no previous best state at time t = 1): γ1 (q) = πq · Bq (x1 ) bp1 (q) = −1 The recursion is similar to that of the forward algorithm.5 illustrates a Viterbi trellis. the state with the highest probability path at time |x| is selected. including backpointers that have been used to compute the most likely state sequence. π that maximize the likelihood of a set of training examples. 125]. EM ALGORITHMS FOR TEXT PROCESSING of states. recursively. the “backpointer”. 6. .4 PARAMETER ESTIMATION FOR HMMS We now turn to Question 3: given a set of states S and observation vocabulary O. one where both the emissions and the state sequence are observable in all training instances. However. γt (r) = max γt−1 (q) · Aq (r) · Br (xt ) bpt (r) = arg max γt−1 (q) · Aq (r) · Br (xt ) q∈S q∈S To compute the best sequence of states.

5: Computing the most likely state sequence that generated John. John DET ADJ NN V might wash Figure 6.6. We also deﬁne N (q) to be the number of times the process enters state q. respectively.03 0. I (q). the process starts in state nn.0 0. The most likely state sequence is highlighted in bold and could be recovered programmatically by following backpointers from the maximal probability cell in the last column to the ﬁrst column (thicker arrows). HIDDEN MARKOV MODELS 121 John DET ADJ NN V 0. the number of times that state q emits symbol o. O(q ↑ o).0 0.3 using the Viterbi algorithm.0003 0. the maximum likelihood estimate for the parameters can be computed in terms of the counts of the number of times the process transitions from state q to state r in all training instances. In this example.00009 γ1 γ2 γ3 Figure 6.0 0. The nn state emits John in the ﬁrst time step. and the number of times the process starts in state q.0 0. and v state emits might and wash in the second and third time steps.2. and the corresponding states and transitions are shown in the trellis below. The output sequence is at the top of the ﬁgure. there is one nn→ v transition and one v → v transition. case.6: A “fully observable” HMM training instance. wash under the HMM given in Figure 6. T (q → r). might.003 wash 0.00006 0.0 0. The maximum likelihood estimates of the parameters in the fully observable case are: .0 0. such as when we have a corpus of sentences in which all words have already been tagged with their parts of speech.0 might 0.

3) Because of the independence assumptions made in the HMM. π. in the fully observable case. This counting and normalizing be accomplished using the exact same counting and relative frequency algorithms that we described in Section 3. there is exactly one transition taken and the source and destination states are observable. with a particular probability. Equations 6. parameter estimation is not a new algorithm at all. and typically |S | is quite small) is taken. EM ALGORITHMS FOR TEXT PROCESSING πq = I (q) = r I (r) Aq (r) = T (q → r) N(q) = r T (q → r ) Bq (o) = O(q ↑ o) (6. that is. the update equations consist of 2 · |S | + 1 independent optimization problems. and |S | solving for the emissions distributions Bq (·) from each state q. between every time step. we can let each transition count as ‘1’. just as was the case with the ‘observable’ HMM. as described above. and keep track of what state the process transitioned into at the next time step. but one we have seen before. the number of times nn follows det). for example. for example.3 be understood? In the fully observed training case. given the model and an observation sequence (we . the probabilities of transitioning away from state det. to compute the emission parameters from state nn.. These statistics can then in turn be used to compute the MLE for an ‘observable’ HMM. Solving for the initial state distribution. However. Transition probabilities are computed similarly: to compute. according to the posterior distribution over the latent variables given the observations x and the parameters θ (i) : πq = E[I (q)] Aq (r) = E[T (q → r)] E[N(q)] Bq (o) = E[O(q ↑ o)] E[N(q)] (6. which we will utilize later to facilitate efﬁcient parallelization in MapReduce. and we can accumulate the total number of times each kind of transition was taken (by each kind. we simply mean the number of times that one state follows another. we simply need to keep track of the number of times the process is in state nn and what symbol it generated at each of these times. the distribution Adet (·).2) require completely independent sets of statistics. the optimization problems (i. we can instead imagine that at each time step.3. Furthermore. The probability used is the posterior probability of the transition. How should the model parameters be estimated when the state sequence is not provided? It turns out that the update equations have the satisfying form where the optimal parameter values for iteration i + 1 are expressed in terms of the expectations of the counts referenced in the fully observed case. is one problem. every possible transition (there are |S |2 of them. we note that the following must hold: E[N(q)] = r∈S E[T (q → r)] = o∈O E[O(q ↑ o)] As a result. Thus. How can the expectations in Equation 6. when the transition sequence is not observable (as is most often the case).e.2) N(q) = o O(q ↑ o ) For example. we count the number of times the process is in state det. By progressing through the Markov chain.122 6. there are |S | solving for the transition distributions Aq (·) from each state q.

4) (6. and generating the observations x1 . θ ) · δ(xi . HIDDEN MARKOV MODELS 123 describe how to compute this value below). since the training instances are statistically independent. By summing over all the time steps in the training data.5) Pr(yi = q. yi+1 = r|x. x2 . xt . Both posterior probabilities can be computed by combining the forward probabilities.6) (6. θ ) Posterior probabilities. and the correctness of the expression should be clear from the deﬁnitions of forward and backward probabilities.6. o) i=1 |x|−1 (6. which give the probability of starting in some state at time t and generating the rest of the sequence xt+1 . βt (·). given the training sequence. These are referred to as posterior probabilities. The intuition for Equation 6. which give the probability of reaching some state at time t. given x. θ ) = αi (q) · βi (q) Pr(yi = q. We must therefore compute the probability of being in every state at each time point. we assume that any state could have generated the observation. and the observation sequence is the evidence). The expectations necessary for computing the M-step in HMM training are sums of probabilities that a particular transition is taken. . . Once the forward and backward probabilities have been computed. . given an observation sequence. The algorithm for computing the backward probabilities is given a bit later. after addition evidence has been taken into consideration (here. x|x| . . . . yi+1 = r|x. are sufﬁcient to execute the M-step. the state transition posterior probabilities and the emission posterior probabilities can be written as follows: Pr(yi = q|x. we compute the expected count of the number of times a particular transition was taken. These two sets of expectations. and using this probability as the ‘count’ (rather than ‘1’ as in the observable case). by any path. given an observation sequence. the value of the expectations can be computed by processing each training instance independently and summing the results.2. using any sequence of states to do so.7) Equation 6. Furthermore. |x| E[O(q ↑ o)] = E[T (q → r)] = i=1 Pr(yi = q|x.6 is the probability of being in state q at time i. with backward probabilities. which is then the size of the emission ‘count’. and that some state emits some observation symbol. which are written formally here. . xt+2 . αt (·).7. is also not complicated. indicating that they are the probability of some event whose distribution we have a prior belief about. θ ) = αi (q) · Aq (r) · Br (xi+1 ) · βi+1 (r) (6. the model parameters characterize our prior beliefs. By summing over all time steps we compute the expected count of the number of times that a particular state generated a particular symbol. . Similarly for the necessary emission counts (the number of times each symbol in O was generated by each state in S ). the probability of taking a particular transition at a particular time. It is the product of four conditionally independent probabilities: the probability of .

b. EM ALGORITHMS FOR TEXT PROCESSING a S1 b b c b S2 S3 2(2) 3(2) Figure 6. s2 . c}. the probability must be 1. The recursion is deﬁned as follows: βt (q) = r∈S βt+1 (r) · Aq (r) · Br (xt+1 ) . Its base case starts at time |x|. The shaded area on the left corresponds to the forward probability α2 (s2 ). the probability of generating observation xi+1 from state r (also speciﬁed in θ ). given the observation sequence a b b c b. and the probability of generating the rest of the sequence. the probability of taking transition q → r (which is speciﬁed in the parameters. In this illustration. and the shaded area on the right corresponds to the backward probability β3 (s2 ).7. θ ).124 6. A visualization of the quantities used in computing this probability is shown in Figure 6. and since there is nothing left to generate after time |x|. getting to state q at time i (having generated the ﬁrst part of the sequence). we assume an HMM with S = {s1 . and is deﬁned as follows: β|x| (q) = 1 To understand the intuition of the base case. along any path. the backward algorithm uses dynamic programming to incrementally compute βt (·). The backward algorithm. s3 } and O = {a. keep in mind that since the backward probabilities βt (·) are the probability of generating the remainder of the sequence after time t (as well as being in some state). Like the forward and Viterbi algorithms introduced above to answer Questions 1 and 2.7: Using forward and backward probabilities to compute the posterior probability of the dashed transition.

EM will never change this in future iterations.3 EM IN MAPREDUCE Expectation maximization algorithms ﬁt quite naturally into the MapReduce programming model. since it provides a way of constraining the structures of the Markov model. This can be useful. one must be aware of this behavior.2. θ (i) . HMMs typically deﬁne a massive number of sequences.5 FORWARD-BACKWARD TRAINING: SUMMARY In the preceding section. hundreds may be required. and the complexity of the model. if some parameter is assigned a probability of 0 (either as an initial value or during one of the M-step parameter updates). 6. EM IN MAPREDUCE 125 Unlike the forward and Viterbi algorithms. MapReduce implementations of EM algorithms share a number of characteristics: . 6.3. which yields θ (i+1) . Although the model being optimized determines the details of the required computations.6. See Section 5. The process then repeats from the E-step using the new parameters. and the number of times each state transitions into each other state. the backward algorithm is computed from right to left and makes no reference to the start probabilities. Finally. Note that expected counts do not typically have this problem and can be represented using normal ﬂoating point numbers. To recap: each training instance x is processed independently. π . depending on the initial parameters that are used. θ (i) . The forward and backward probabilities are in turn used to compute the expected number of times the underlying Markov process enters into each state. the forward and backward probabilities are computed using the algorithms given above (for this reason. completing the E-step. and the quality of the learned model may vary considerably. Another pitfall to avoid when implementing HMMs is arithmetic underﬂow. For each x in the training data. These expectations are summed over all training instances. a few practical considerations: HMMs have a non-convex likelihood surface (meaning that it has the equivalent of many hills and valleys in the number of dimensions corresponding to the number of parameters in the model). The M-step involves normalizing the expected counts computed in the E-step using the calculations in Equation 6. Strategies for optimal selection of initial parameters depend on the phenomena being modeled. however.4 for additional discussion on working with log probabilities. using the parameter settings of the current iteration. and so the probability of any one of them is often vanishingly small—so small that they often underﬂow standard ﬂoating point representations. this training algorithm is often referred to as the forwardbackward algorithm). only a handful of iterations are necessary. For some applications. whereas for others. we showed how to compute all quantities needed to ﬁnd the parameter settings θ (i+1) using EM training with a hidden Markov model M = S . O. the number of times each state generates each output symbol type.3. EM training is only guaranteed to ﬁnd a local maximum. The number of iterations required for convergence depends on the quality of the initial parameters. A very common solution to this problem is to represent probabilities using their logarithms. Additionally.

are loaded by each mapper from HDFS or other data provider (e. Still. The degree of parallelization that can be attained depends on the statistical independence assumed in the model and in the derived quantities required to solve the optimization problems in the M-step. First.1 HMM TRAINING IN MAPREDUCE As we would expect. such as expected counts). and then generate parameter estimates for the next iteration using the updates given in Equation 6. emitting expected event counts computed using the forward-backward algorithm introduced in Section 6. which sum together the training statistics. a distributed key-value store). it is possible that in the worst case. • Combiners. • Reducers sum together the required training statistics and solve one or more of the M-step optimization problems. • Mappers map over independent training instances. must aggregate the statistics necessary to solve the optimization problems as required by the model. computing partial latent variable posteriors (or summary statistics. EM ALGORITHMS FOR TEXT PROCESSING • Each iteration of EM is one MapReduce job. many common models (such as HMMs) require solving several independent optimization problems in the M-step.i.g.e. This parallelization strategy is effective for several reasons.. . In this situation.3.. the majority of the computational effort in HMM training is the running of the forward and backward algorithms. • A controlling process (i. however. the M-step optimization problem will not decompose into independent subproblems. completing the E-step. in fact. Fortunately. are often quite effective at reducing the amount of data that must be shufﬂed across the network. making it necessary to use a single reducer.2. • Model parameters θ (i) . Reducers aggregate the expected counts. The process can be summarized as follows: in each iteration. which are static for the duration of the MapReduce job. Since parameters are estimated from a collection of samples that are assumed to be i. each independent training instance could be processed by a separate mapper!10 Reducers. and this constrains the number of reducers that may be used. In the limit. given that the startup costs associated with individual map tasks in Hadoop may be considerable. a number of reducers may be run in parallel. Since there is no limit on the number of mappers that may be run. the full computational resources of a 10 Although the wisdom of doing this is questionable.126 6. 6.d. the E-step can generally be parallelized effectively since every training instance can be processed independently of the others. the training of hidden Markov models parallelizes well in MapReduce.3.4. driver program) spawns the MapReduce jobs and keeps track of the number of iterations and convergence criteria. mappers process training instances.. The degree to which these may be solved independently depends on the structure of the model.

HMM training reducer. While the optimization problem is computationally trivial. EM IN MAPREDUCE 127 cluster may be brought to bear to solve this problem.g. however. and every training instance emits the same set of keys. As a result of the similarity. the associative array contains all of the statistics necessary to compute a subset of the parameters for the next EM iteration. and 3. with a key indicating that the associated value is a set of transition counts from state q. as described in Section 3. we can employ the stripes representation for aggregating sets of related values. with a unique key designating that the values are initial state counts. 2. A pairs approach that requires less memory at the cost of slower performance is also feasible. shown together with an optional combiner in Figure 6.9. The input consists of key-value pairs with a unique id as the key and a training instance (e. The reducer for one iteration of HMM training. with a key indicating that the associated value is a set of emission counts from state q. The outputs are: 1. and emission probabilities Bq for each state q. the expected number of times that state q generated each emission symbol o (the set of emission symbols included will be just those found in each training instance x). transition probabilities Aq for each state q.3. this may be exploited with as many as 2 · |S | + 1 reducers running in parallel. HMM training mapper. Each unique key corresponds to one of the independent optimization problems that will be solved in the M-step. aggregates the count collections associated with each key by summing them. The new computed parameters are emitted from the reducer and written to HDFS. Note that they will be spread across 2 · |S | + 1 keys. The pseudo-code for the HMM training mapper is given in Figure 6.. For each training instance. The optimal parameter settings for the following iteration are computed simply by computing the relative frequency of each event with respect to its expected count at the current iteration. Second. the probabilities that the unobserved Markov process begins in each state q. 2n + 1 stripes are emitted with unique keys. . being able to reduce in parallel helps avoid the data bottleneck that would limit performance if only a single reducer were used.6.2. representing initial state probabilities π .8. The statistics that are required to solve the M-step optimization problem are quite similar to the relative frequency estimation example discussed in Section 3. a sentence) as the value. When the values for each key have been completely aggregated. the expected number of times state q transitions to each state r. rather than counts of observed events. since the M-step of an HMM training iteration with |S | states in the model consists of 2 · |S | + 1 independent optimization problems that require non-overlapping sets of statistics.3. we aggregate expected counts of events.

sequences of observations xi ) and generate the expected counts of initial states. stripe O{q}) Emit(string ‘transit from ’ + q.4 I ← new AssociativeArray Initial state expectations for all q ∈ S do Loop over states I {q} ← α1 (q) · β1 (q) O ← new AssociativeArray of AssociativeArray Emissions for t = 1 to |x| do Loop over observations for all q ∈ S do Loop over states O{q}{xt } ← O{q}{xt } + αt (q) · βt (q) t ←t +1 T ← new AssociativeArray of AssociativeArray Transitions for t = 1 to |x| − 1 do Loop over observations for all q ∈ S do Loop over states for all r ∈ S do Loop over states T {q}{r} ← T {q}{r} + αt (q) · Aq (r) · Br (xt+1 ) · βt+1 (r) t ←t +1 Emit(string ‘initial ’. stripe I ) for all q ∈ S do Emit(string ‘emit from ’ + q. The mappers map over training instances (i. O ←ReadModel θ ← A. B.e. θ) cf. emissions..8: Mapper pseudo-code for training hidden Markov models using EM.2.2 β ←Backward(x.128 6. θ) cf.2. EM ALGORITHMS FOR TEXT PROCESSING 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: class Mapper method Initialize(integer iteration) S . sequence x) α ←Forward(x. Section 6. stripe T {q}) Loop over states Figure 6. Section 6. π ←ReadModelParams(iteration) method Map(sample id. and transitions taken to generate the sequence. .

. . C) z←0 for all k.]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . .9: Combiner and reducer pseudo-code for training hidden Markov models using EM. .]) Cf ← new AssociativeArray for all stripe C ∈ stripes [C1 . so reducers do not require special logic to handle different types of model parameters (since they are all of the same type). . . . C) Emit(string t. v ∈ Cf do z←z+v Pf ← new AssociativeArray for all k.] do Sum(Cf .6. . stripe Pf ) Figure 6. . . .] do Sum(Cf . stripe Cf ) class Reducer method Reduce(string t. v ∈ Cf do Pf {k} ← v/z Final parameters vector Emit(string t. C2 . . stripes [C1 . The HMMs considered in this book are fully parameterized by multinomial distributions. EM IN MAPREDUCE 129 1: 2: 3: 4: 5: 6: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: class Combiner method Combine(string t. C2 . C2 . .3. stripes [C1 . C2 .

be leveraged to improve a French-English system. . meaning that improvements in a Russian-English translation system could not. the ﬁeld has grown tremendously and numerous statistical models of translation have been developed. Since the advent of statistical approaches to translation. 97]. For a number of years. Not only was constructing a translation system labor intensive. which took a data-driven approach to solving the problem of machine translation. like many other topics we are considering in this book. the idea of fully automated translation was viewed with skepticism. the ﬁeld was reinvigorated in the late 1980s when researchers at IBM pioneered the development of statistical machine translation (SMT). work on translation was dominated by manual attempts to encode linguistic knowledge into computers—another instance of the ‘rulebased’ approach we described in the introduction to this chapter.130 6. The core idea of SMT is to equip the computer to learn how to translate. and modeling the process as a statistical process with some parameters θ relating strings in a source language (typically denoted as f) to strings in a target language (typically denoted as e): e∗ = arg max Pr(e|f. for the most part. with many incorporating quite specialized knowledge about the behavior of natural language as biases in their learning algorithms. is an attempt to leverage the vast quantities of textual data that is available to solve problems that would otherwise require considerable manual effort to encode specialized knowledge. translation systems can be developed cheaply and quickly for any language pair. which is an important task in statistical machine translation that is typically solved using models whose parameters are learned with EM. there was considerable optimism that translation of human languages would be another soluble problem. SMT. for a more comprehensive introduction. After successes with code-breaking during World War II. Furthermore. Fully automated translation has been studied since the earliest days of electronic computers. We begin by giving a brief introduction to statistical machine translation and the phrase-based translation approach. we turn to the problem of word alignment. using example translations which are produced for other purposes. Thus. θ ) e With the statistical approach. These early attempts failed to live up to the admittedly overly optimistic expectations. attempting to improve both the quality of translation while reducing the cost of developing systems [29]. EM ALGORITHMS FOR TEXT PROCESSING 6. In the early years. improvements in learning algorithms and statistical modeling can yield beneﬁts in many translation pairs at once. as long as there is sufﬁcient training data available. refer to [85.4 CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION To illustrate the real-world beneﬁts of expectation maximization algorithms using MapReduce. rather than being speciﬁc to individual language pairs. After languishing for a number of years. but translation pairs had to be developed independently.

10. they are not required to correspond to the deﬁnition of a phrase in any linguistic theory. phrases can be extracted from the sentence that is likely to preserve the meaning relationships represented by the word alignment. which contains pairs of sentences in two languages that are translations of each other.12 Example phrase pairs for Spanish-English translation might include los estudiantes.4. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 131 6. that the sequence histories only depend on prior local 11 http://translate. i am . wn . Parallel corpora are frequently generated as the by-product of an organization’s effort to disseminate information in multiple languages. it is customary to make the Markov assumption. In addition to the translation model. Pr(wn |w1 ) = k=1 k−1 Pr(wk |w1 ) (6. . We brieﬂy note that although EM could be utilized to learn the phrase translation probabilities.4. . . the students .6. this is not typically done in practice since the maximum likelihood solution turns out to be quite bad for this problem. From a few hundred thousand sentences of example translations. is a string in the target language. The starting point is typically a parallel corpus (also called bitext).google. The collection of phrase pairs and their scores are referred to as the translation model. . The phrase-based translation process is summarized in Figure 6. . each phrase pair is associated with a number of scores which. a conditional probability that reﬂects how likely the source phrase translates into the target phrase. some students . By using these word alignments as a skeleton. . are used to compute the phrase translation probability.1 STATISTICAL PHRASE-BASED TRANSLATION One approach to statistical translation that is simple yet powerful is called phrase-based translation [86]. phrase-based translation depends on a language model. such as the one used inside Google Translate. The parallel corpus is then annotated with word alignments. called phrases. translate between languages. w2 . which gives the probability of a string in the target language. while the language model ensures that the output is ﬂuent and grammatical in the target language.com 12 Phrases are simply sequences of words. which we show below how to compute with EM.11 Phrase-based translation works by learning how strings of words. . which indicate which words in one language correspond to words in the other. and text generated by the United Nations in many different languages. written n for short. proceedings of the Canadian Parliament in French and English. we mention it as a motivation for learning word alignments. for example. By the chain rule of probability.8) Due to the extremely large number of parameters involved in estimating such a model directly. We provide a rough outline of the process since it is representative of most state-of-theart statistical translation systems. many millions of such phrase pairs may be automatically learned. we get: as w1 n n−1 n 2 Pr(w1 ) = Pr(w1 ) Pr(w2 |w1 ) Pr(w3 |w1 ) . A language model gives the probability that a string of words w = w1 . After phrase extraction. While an explanation of the process is not necessary here. los estudiantes. and soy. The translation model attempts to preserve the meaning of the source language during the translation process. taken together.

10: The standard phrase-based machine translation architecture. .11: Translation coverage of the sentence Maria no dio una bofetada a la bruja verde by a phrasebased model. EM ALGORITHMS FOR TEXT PROCESSING Training Data i saw the small table vi la mesa pequeña Parallel Sentences Word Alignment Phrase Extraction (vi.132 6. Maria no dio una bofetada a la bruja verde Mary not t did not no give i a a slap slap slap l to t by to the to the the th witch it h green green witch did not give slap the witch Figure 6. The best possible translation path is indicated with a dashed line. which performs the actual translation. the small table) … he sat at the table the service was good Target Language Target-Language Text Language Model Translation Model Decoder maria no daba una bofetada a la bruja verde Foreign Input Sentence mary did not slap the green witch English Output Sentence Figure 6. Both serve as input to the decoder. The language model is estimated from a monolingual corpus. The translation model is constructed with phrases extracted from a word-aligned parallel corpus. i saw) (la mesa pequeña.

For higher-order models (e. could a simpler smoothing strategy.4. k−1 we can approximate P (wk |w1 ) as follows: k−1 bigrams: P (wk |w1 ) ≈ P (wk |wk−1 ) k−1 trigrams: P (wk |w1 ) ≈ P (wk |wk−1 wk−2 ) k−1 k−1 n-grams: P (wk |w1 ) ≈ P (wk |wk−n+1 ) (6. where V is the number of words in the vocabulary. introduced a technique known as “stupid backoff ” that was exceedingly simple and so naïve that the resulting model didn’t even deﬁne a valid probability distribution (it 13The phrases may not necessarily be selected in a strict left-to-right order. training n-gram language models still requires estimating an enormous number of parameters: potentially V n . which all share the basic idea of moving probability mass from observed to unseen events in a principled manner. To cope with this sparseness. Brants et al. Being able to vary the order of the phrases used is necessary since languages may express the same ideas using different word orders.13 The decoder seeks to ﬁnd the translation that maximizes the product of the translation probabilities of the phrases used and the language model probability of the resulting string in the target language. Thus. For many applications. That is. the number of parameters can easily exceed the number of words from which to estimate those parameters. researchers have developed a number of smoothing techniques [102].e.. we refer the reader to a recent textbook by Koehn [85]. . For a detailed introduction to phrase-based decoding.2 BRIEF DIGRESSION: LANGUAGE MODELING WITH MAPREDUCE Statistical machine translation provides the context for a brief digression on distributed parameter estimation for language models using MapReduce. the web). Even after making the Markov assumption. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 133 context. most n-grams will never be observed in a corpus. no matter how large. 5-grams) used in real-world applications.11.6. In fact.9) (6. an n-gram language model is equivalent to a (n − 1)th-order Markov model. and provides another example illustrating the effectiveness of data-driven approaches in general.4. A sequence of phrase pairs is selected such that each word in f is translated exactly once. Because the phrase translation probabilities are independent of each other and the Markov assumption made in the language model. as an example illustrates in Figure 6. Since only target language text is necessary (without any additional annotation).11) n The probabilities used in computing Pr(w1 ) based on an n-gram language model are generally estimated from a monolingual corpus of target language text. Brants et al. [25] reported experimental results that answered an interesting question: given the availability of large corpora (i.g. 6. To translate an input sentence f. this may be done efﬁciently using dynamic programming. We brieﬂy touched upon this work in Chapter 1. the phrase-based decoder creates a matrix of all translation possibilities of all substrings in the input string.10) (6.. applied to more text. beat Kneser-Ney in a machine translation task? It should come as no surprise that the answer is yes. language modeling has been well served by largedata approaches that take advantage of the vast quantities of text available on the web. a state-of-the-art approach is known as Kneser-Ney smoothing [35]. In 2007.

the better its description of relevant language phenomena and hence its ability to select good translations. and eventually disappeared with sufﬁcient data. Applying this language model to a machine translation task yielded better results than a (smaller) language model trained with Kneser-Ney smoothing. 6. Alignment models that make this assumption generate a string e in the target language by selecting words in the source language according to a lexical translation distribution. The indices of the words in f used to generate each word in e are stored in an alignment variable. e) Alignment probability × i=1 Pr(ei |fai ) Lexical probability Since we have parallel corpora consisting of only f. In this section. however. we omit it from our presentation for simplicity. While this independence assumption is problematic in many ways. and |a| = |e|. a special null word is added to the source sentences. as the amount of data increased. can be learned automatically using EM. and their alignment is the latent variable. However. In the statistical model of word alignment considered here. Since this does not change any of the important details of training. With smaller corpora. stupid backoff didn’t work as well as Kneser-Ney in generating accurate and ﬂuent translations. However. respectively). we can learn the parameters for this model using EM and treating a as a latent variable. it is assumed that words are translated independently of one another. the gap between stupid backoff and Kneser-Ney narrowed. To make this model tractable. it results in a simple model structure that admits efﬁcient inference yet produces reasonable alignments. afforded an extremely scalable implementations in MapReduce. a. The simplicity. Using these assumptions. . the observable variables are the words in the source and target sentences (conventionally written using the variables f and e.4. EM ALGORITHMS FOR TEXT PROCESSING assigned arbitrary scores as opposed to probabilities).14 This means that the variable ai indicates the source word position of the i th target word generated. which permits words to be inserted ‘out of nowhere’. a|f) = Pr(a|f. we introduce a popular alignment model based on HMMs. to combat data sparsity in the alignment 14 In the original presentation of statistical lexical translation models. grammatical translations from a large hypothesis space: the more training data a language model has access to. Furthermore. Once again. we refer the reader to a forthcoming book from Morgan & Claypool [26].134 6. which are necessary for building phrase-based translation models (as well as many other more sophisticated translation models). which means that the model’s parameters include the probability of any word in the source language translating to any word in the target language.3 WORD ALIGNMENT Word alignments. large data triumphs! For more information about estimating language models using MapReduce. the probability of an alignment and translation can be written as follows: |e| Pr(e. e pairs. The role of the language model in statistical machine translation is to select ﬂuent. with stupid backoff it was possible to train a language model on more data than was feasible with Kneser-Ney smoothing.

the alignment process is quite slow as the size of the training data grows— .4. Both axes in the ﬁgure are on a log scale. a|f) = Pr(ei |fai ) i=1 This model is known as IBM Model 1.12 shows the performance of Giza++ in terms of the running time of a single EM iteration for both Model 1 and the HMM alignment model as a function of the number of training pairs.There are three things to note. and our model acquires the structure of an HMM [150]: |e| |e| Pr(e. We compared the training time of Giza++ and our aligner on a Hadoop cluster with 19 slave nodes.4. Figure 6.6. the running time scales linearly with the size of the training data. 6.4 EXPERIMENTS How well does a MapReduce word aligner for statistical machine translation perform? We describe previously published results [54] that compared a Java-based Hadoop implementation against a highly optimized word aligner called Giza++ [112]. Second. and therefore EM will learn the same solution regardless of initialization. To properly initialize this HMM. summing over all settings of a. First. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION 135 probability. while the forwardbackward algorithm could be used to compute the expected counts necessary for training this model by setting Aq (r) to be a constant value for all q and r. Third. and the best alignment for a sentence pair can be found using the Viterbi algorithm. but the ticks on the y-axis are aligned with ‘meaningful’ time intervals rather than exact orders of magnitude. the uniformity assumption means that the expected emission counts can be estimated in time O(|e| · |f|). which was written in C++ and designed to run efﬁciently on a single core. words that are nearby in the source language will tend to be nearby in the target language). the HMM is a constant factor slower than Model 1. and use this simpler model to learn initial lexical translation (emission) parameters for the HMM. each with two single-core processors and two disks (38 cores total). It is attractive for initialization because it is convex everywhere. rather than time O(|e| · |f|2 ) required by the forward-backward algorithm. The favored simpliﬁcation is to assert that all alignments are uniformly probable: 1 × |f||e| |e| Pr(e. it is conventional to further simplify the alignment probability model. By letting the probability of an alignment depend only on the position of the previous aligned word we capture a valuable insight (namely. we must make some further simplifying assumptions. a|f) = i=1 Pr(ai |ai−1 ) × i=1 Transition probability Pr(ei |fai ) Emission probability This model can be trained using the forward-backward algorithm described in the previous section. Finally.

even an unsophisticated implementation of the HMM aligner could compete favorably with a highly optimized single-core system whose performance is well-known to many people in the MT research community.org/wmt10/translation-task.12 indicates that the MapReduce implementation is still substantially faster than the single core implementation. and in fact. in comparison to the running times of the single-core implementation. decreases. C++ performance. we hope. a corpus of one billion words of French-English data was mined automatically from the web and released publicly [33]. which allows researchers to more quickly explore the solution space—which will. has almost the same running time as the HMM alignment algorithm. First. does not approach the theoretical performance of an ideal parallelization. Although these results do confound several variables ( Java vs. even of Model 1. which means that full training takes the better part of a day. the results presented here show that even at data sizes that may be tractable on single machines. more data lead to improvements in translation quality [54]. For reference.html .136 6. the relative cost of the overhead associated with distributing data.15 Single-core solutions to model construction simply cannot keep pace with the amount of translated data that is constantly being produced. we plot points indicating what 1/38 of the running time of the Giza++ iterations would be at each data size. memory usage patterns). these results show that when computation is distributed over a cluster of many machines. we note that.statmt. as the amount of data increases. there is a signiﬁcant advantage to using the distributed implementation. Furthermore. Finally. lead to rapid new developments in statistical machine translation. which gives a rough indication of what an ‘ideal’ parallelization could achieve. several independent researchers have shown that existing modeling algorithms can be expressed naturally and effectively using MapReduce. the opposite pattern from what we actually observe). which we observe is light on computation. Why are these results important? Perhaps the most signiﬁcant reason is that the quantity of parallel data that is available to train statistical machine translation models is ever increasing. which means that we can take advantage of this data. EM ALGORITHMS FOR TEXT PROCESSING at one million sentences. at large data sizes. at least once a certain training data size is reached. although a comparison with Figure 6. We conclude that the overhead associated with distributing and aggregating data is signiﬁcant compared to the Model 1 computations. This improvement reduces experimental turnaround times. and as is the case with so many problems we have encountered. In Figure 6. 15 http://www. At one million sentence pairs of training data. a single iteration takes over three hours to complete! Five iterations are generally necessary to train the models. signiﬁcant performance improvements are attainable using MapReduce implementations. of course. assuming that there was no overhead associated with distributing computation across these machines.Three things may be observed in the results. marshaling and aggregating counts. Fortunately. Model 1. Furthermore. Recently.13 we plot the running time of our MapReduce implementation running on the 38-core cluster described above. the HMM alignment iterations begin to approach optimal runtime efﬁciency. Second. it is reasonable to expect that the confounds would tend to make the single-core system’s performance appear relatively better than the MapReduce system (which is.

.13: Running times of our MapReduce implementation of Model 1 and HMM training iterations at various corpus sizes.6.12: Running times of Giza++ (baseline single-core system) for Model 1 and HMM training iterations at various corpus sizes. CASE STUDY: WORD ALIGNMENT FOR STATISTICAL MACHINE TRANSLATION Model 1 HMM Average iteration latency (seconds) 3 hrs 60 min 20 min 5 min 90 s 30 s 10 s 3s 10000 100000 Corpus size (sentences) 1e+06 137 Figure 6. 1/38 running times of the Giza++ models are shown. For reference.4. 3 hrs 60 min Time (seconds) 20 min 5 min 90 s 30 s 10 s 3s Optimal Model 1 (Giza/38) Optimal HMM (Giza/38) MapReduce Model 1 (38 M/R) MapReduce HMM (38 M/R) 10000 100000 Corpus size (sentences) 1e+06 Figure 6.

some of the previously introduced techniques are also applicable for optimizing these models.1 GRADIENT-BASED OPTIMIZATION AND LOG-LINEAR MODELS Gradient-based optimization refers to a class of iterative optimization algorithms that use the derivatives of a function to ﬁnd the parameters that yield a minimal or maximal value of that function. Second. (θ). their gradients take the form of expectations. is differentiable. . . phrase extraction. which refers to a class of techniques used to optimize any objective function. the gradient ∇F is a vector ﬁeld that points in the direction of the greatest increase of F and whose magnitude indicates the rate of increase.5 EM-LIKE ALGORITHMS This chapter has focused on expectation maximization algorithms and their implementation in the MapReduce programming framework. 6. provided it is differentiable with respect to the parameters being optimized. EM ALGORITHMS FOR TEXT PROCESSING For the reader interested in statistical machine translation. these algorithms are only applicable in cases where a useful objective exists. In addition to being widely used supervised classiﬁcation models in text processing (meaning that during training. Its gradient is deﬁned as: ∇F (θ ) = ∂F ∂F ∂F (θ ). . both the data and their annotations must be observable). In this section we focus on gradient-based optimization. Fortunately. we will give examples in terms of minimizing functions. and they can be implemented quite naturally in MapReduce.5. We now explore some related learning algorithms that are similar to EM but can be used to solve more general problems. These important algorithms are indispensable for learning models with latent structure from unannotated data. (θ ) ∂θ1 ∂θ2 ∂θk The gradient has two crucial properties that are exploited in gradient-based optimization. Assume that we have some real-valued function F (θ ) where θ is a k-dimensional vector and that F is differentiable with respect to θ. Gradient-based optimization is particularly useful in the learning of maximum entropy (maxent) models [110] and conditional random ﬁelds (CRF) [87] that have an exponential form and are trained to maximize conditional likelihood. then the following is true: ∇F (θ ∗ ) = 0 . For the purposes of this discussion. and phrase scoring [56]. and its derivatives can be efﬁciently evaluated. Obviously. As a result. this is the case for many important problems of interest in text processing.138 6. 6. there is an open source Hadoopbased MapReduce implementation of a training pipeline for phrase-based translation that includes word alignment. . if θ ∗ is a (local) minimum of F. and discuss their implementation. First.

parallelization across multiple reducers is non-trivial.5. refer to Section 2. where the structure of the model determines the speciﬁcs of the realization. Such external side effects must be handled carefully.2 for a discussion. This implies that the gradient also decomposes linearly. Like EM. compute the total objective and gradient. . EM-LIKE ALGORITHMS 139 An extremely simple gradient-based minimization algorithm produces a series of parameter estimates θ (1) . the details of the function being optimized determines how it should best be implemented. The values they emit are pairs F (θ ). • The objective should decompose linearly across training instances. This may either be emitted together with θ (i+1) or written to the distributed ﬁle system as a side effect of the reducer. • Whether more than one reducer can run in parallel depends on the speciﬁc optimization algorithm being used.12) The parameter η(i) > 0 is a learning rate which indicates how quickly the algorithm moves along the gradient during iteration i.6. More sophisticated algorithms perform updates that are informed by approximations of the second derivative. which are estimated by successive evaluations of ∇F (θ). • Many optimization algorithms are stateful and must persist their state between optimization iterations. and proper selection of η is non-trivial. . • Reducer(s) sum the component objective/gradient pairs. and not every function optimization problem will be a good ﬁt for MapReduce. and therefore mappers can process input data in parallel. . Some. run the optimization algorithm. this update strategy may converge slowly. . by starting with some initial parameter settings θ (1) and updating parameters through successive iterations according to the following rule: θ (i+1) = θ (i) − η(i) ∇F (θ (i) ) (6. which are linear components of the objective and gradient. this strategy will ﬁnd a local minimum of F . while simple. This make parallelization with MapReduce worthwhile. and can converge much more rapidly [96]. like the trivial algorithm of Equation 6. Nevertheless. However. θ (2) . whereas many are sensitive to global properties of ∇F (θ ). Gradient-based optimization algorithms can often be implemented effectively in MapReduce.12 treat the dimensions of θ independently. MapReduce implementations of gradient-based optimization tend to have the following characteristics: • Each optimization iteration is one MapReduce job. Gradient-based optimization in MapReduce. In the latter case. ∇F (θ ) . • Evaluations of the function and its gradient are often computationally expensive because they require processing lots of data. and emit θ (i+1) . Provided this value is small enough that F decreases.

θ ) [Hi (x. . can be implemented effectively using MapReduce. be used to compute the expectation EPr(y |x. as was the case with HMMs. . θ ) = exp i θi · Hi (x.: F (θ ) = θ∗ x. y 1 .2. x. The gradient derivative of F with respect to θi can be shown to have the following form [141]:16 ∂F (θ ) = ∂θi Hi (x. when very large event spaces are being modeled. independence assumptions can be used to enable efﬁcient computation using dynamic programming.13) (6. it is possible to directly model the conditional distribution of label given input: Pr(y|x. with only minimal modiﬁcation. 16This assumes that when x. enumerating all possible values y can become computationally intractable.θ) [Hi (x.14) = arg min F (θ ) As Equation 6. as is the case with sequence labeling. EM ALGORITHMS FOR TEXT PROCESSING Parameter learning for log-linear models.y The expectation in the second part of the gradient’s expression can be computed using a variety of techniques. y )] needed in CRF sequence models. In fact. MapReduce offers signiﬁcant speedups when training iterations require running the forward-backward algorithm. As we saw in the previous section. as long as the feature functions respect the same Markov assumption that is made in HMMs. y )] x. We therefore include a brief summary. [140]. which we assume to be i. The same pattern of results holds when training linear CRFs. θ ) θ (6. y) y exp i θi · Hi (x. such models are used extensively in text processing applications.i. we refer the reader to Sha et al. y 2 . meaning it can be optimized quite well in MapReduce. the forward-backward algorithm introduced in Section 6.13 makes clear. . Gradient-based optimization techniques can be quite effectively used to learn the parameters of probabilistic models with a log-linear parameterization [100]. where an annotation y ∈ Y is available for every x ∈ X in the training data. Hi are real-valued functions sensitive to features of the input and labeling.. there are no additional latent variables). In this case. And. . Log-linear models are particularly useful for supervised learning (unlike the unsupervised models learned with EM). The parameters of the model is selected so as to minimize the negative conditional log likelihood of a set of training instances x.e. y ) In this expression. While a comprehensive introduction to these models is beyond the scope of this book.140 6. y) − EPr(y |x.4 can. y is present the model is fully observed (i. the objective decomposes linearly across training instances.y − log Pr(y|x. which may otherwise be computationally expensive. For more information about inference in CRFs using the forward-backward algorithm.d. However. as we saw with EM. and their training using gradient-based optimization. .

which means that they have access to far more training data than comparable supervised approaches. the ability to leverage clusters of commodity hardware to parallelize computationally expensive algorithms is an important use case. machine learning is an attractive problem for MapReduce and an area of active research.and M-steps. EM algorithms are unsupervised learning algorithms.6. we were mostly referring to unsupervised approaches. many models.6. make independence assumptions that permit an high degree of parallelism in both the E. these algorithms can be expressed naturally in the MapReduce programming model. In Chapter 1. Chu et al. Since EM algorithms are relatively computationally expensive. with certain similarities to EM. Finally. We focused especially on EM algorithms for three reasons. Issues associated with a MapReduce implementation of latent Dirichlet allocation (LDA).17 and it is also the subject of a forthcoming book [116]. Additional Readings. focusing on a normal form for expressing a variety of machine learning algorithms in MapReduce. they are particularly well positioned to take advantage of large clusters. even for small amounts of data.This is quite important. making them a good example of how to express a commonly used algorithm in this new framework. given that the manual effort required to generate annotated data remains a bottleneck in many supervised approaches. [37] presented general formulations of a variety of machine learning problems. which is another important unsupervised learning technique. provides researchers with an effective strategy for developing increasingly effective applications.6 SUMMARY AND ADDITIONAL READINGS This chapter focused on learning the parameters of statistical models from data. Thus. SUMMARY AND ADDITIONAL READINGS 141 6.apache. when we hailed large data as the “rising tide that lifts all boats” to yield more effective algorithms. Although MapReduce has been designed with mostly data-intensive applications in mind. this led us to consider how related supervised learning models (which typically have much less training data available). [151]. using expectation maximization algorithms or gradient-based optimization techniques. have been explored by Wang et al. but it is also useful for parallelizing expensive computations. combined with the ability of MapReduce to process large datasets in parallel. given the enormous quantities of data available “for free”. such as the widely used hidden Markov model (HMM) trained using EM. The discussion demonstrates that not only does MapReduce provide a means for coping with ever-increasing amounts of data. Data acquisition for unsupervised algorithms is often as simple as crawling speciﬁc web sources. Because of its ability to leverage large amounts of training data. This. can also be implemented in MapReduce. Second. The Apache Mahout project is an open-source implementation of these and other learning algorithms.org/mahout/ . First. 17 http://lucene.

.

7. from machine translation to spam detection. In the preceding chapters. Section 7. which are examples of algorithms that require maintaining global state. However. The ﬁrst example is online learning. In the commercial sphere. hopefully leading to better business decisions and competitive advantages.1 LIMITATIONS OF MAPREDUCE As we have seen throughout this book.3. this is difﬁcult to accomplish in MapReduce. larger datasets lead to more effective algorithms for a wide range of tasks. As a result. making them difﬁcult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing). However. In the natural and physical sciences. no tool—no matter how powerful or ﬂexible—can be perfectly adapted to every task. On one hand. An earlier update would seem.2 discusses alternative programming models. For engineers building information processing tools and applications. we have shown how MapReduce can be exploited to solve a variety of problems related to text processing at scales that would have been unthinkable a few years ago. This simply means that the full “batch” of training data is processed before any updates to the model parameters are made. Recall from Chapter 6 the concept of learning as the setting of parameters in a statistical model. solutions to many interesting problems in text processing do not require global synchronization.1 covers online learning algorithms and Monte Carlo simulations. waiting to be unlocked by the right computational tools. this is quite reasonable: updates are not made until the full evidence of the training data has been weighed against the model. since map and reduce tasks run independently and in isolation. Not only are terabyteand petabyte-scale datasets rapidly becoming commonplace. the ability to analyze massive amounts of data may provide the key to unlocking the secrets of the cosmos or the mysteries of life. but there is consensus that great value lies buried in them. in some sense. . and the book concludes in Section 7.143 CHAPTER 7 Closing Remarks The need to process enormous quantities of data has never been greater. they can be expressed naturally in MapReduce. there are many examples of algorithms that depend crucially on the existence of shared global state during processing. Both EM and the gradient-based learning algorithms we described are instances of what are known as batch learning algorithms. As we have seen. business intelligence—driven by the ability to gather data from a dizzying array of sources— promises to help organizations better understand their customers and the marketplace. Section 7. to be hasty. so it is only fair to discuss the limitations of the MapReduce programming model and survey alternatives.

such a style of computation would likely result in inefﬁcient use of resources. A related difﬁculty occurs when running what are called Monte Carlo simulations. In the limit. synchronization of this resource is enforced by the MapReduce framework. these updates must occur after processing smaller numbers of instances. One approach is to build a distributed datastore capable of maintaining the global state. The model parameters in a learning algorithm can be viewed as shared global state. the gradient computed from a small subset of training instances is often quite reasonable. show that the merged solution is inferior to the output of running the training algorithm on the entire dataset [52]. This is acceptable because in most circumstances. particularly for unsupervised learning applications where they have been found to be far more effective than EM-based learning (which requires specifying the model).144 7. but is rather inferred from training data. and then simple frequency statistics are computed over the samples. This sort of inference is particularly useful when dealing with so-called nonparametric models. However. An alternative is to abandon shared global state and run independent instances of the training algorithm in parallel (on different portions of the data). Thinking in terms of gradient optimization (see Section 6. for example. where updates occur in one or more reducers (or. Experiments. Unfortunately. Being able to parallelize Monte Carlo simulations would be tremendously valuable. using only a few training instances. these high startup costs become intolerable. Since MapReduce was speciﬁcally optimized for batch operations over large amounts of data. even if those updates are made by considering less data [24]. The problem of global state is sufﬁciently pervasive that there has been substantial work on solutions. A ﬁnal solution is then arrived at by merging individual results.5). updates can be made after every training instance. which are used to perform inference in probabilistic models where evaluating or representing the model exactly is impossible. The basic idea is quite simple: samples are drawn from the random variables in the model to simulate its behavior. imagine learning a hidden Markov model. for small datasets. In Hadoop. which must be updated as the model is evaluated against training data. Although only an approximation. this cost is amortized over the processing of many keyvalue pairs. implementing online learning algorithms in MapReduce is problematic. For an illustration. . in the driver code). alternatively. with online learning. map and reduce tasks have considerable startup costs. which are models whose structure is not speciﬁed in advance. and the aggregate behavior of multiple updates tends to even out errors that are made. but inferring the number of states. online learning algorithms can be understood as computing an approximation of the true gradient. however. it is generally the case that more frequent updates can lead to more rapid convergence of the model (in terms of number of training instances processed). rather than having them speciﬁed. CLOSING REMARKS However. All processes performing the evaluation (presumably the mappers) must have access to this state. which goes against the design choices of most existing MapReduce implementations. This means that the framework must be altered to support faster processing of smaller datasets. In a batch learner. However. MapReduce offers no natural mechanism for managing the global shared state that would be required for such an implementation. Although recent work [10] has shown that the delays in synchronizing sample statistics due to parallel implementations do not necessarily damage the inference.

TCP pipes. they can often take advantage of modern GPUs. might also be useful in this respect. etc. Pregel was speciﬁcally designed for large-scale graph algorithms. . and has been used in exactly this manner. sensor readings. respectively) are sufﬁciently mature to handle the low-latency and high-throughput demands of maintaining global state in the context of massively distributed processing (but recent benchmarks are encouraging [40]). What is the signiﬁcance of these developments? The power of MapReduce derives from providing an abstraction that allows developers to harness the power of large clusters. distributed. which is a distributed key-value store (with a very different architecture). such a system would need to be highly scalable to be used in conjunction with MapReduce. but others more difﬁcult. The idea of stream processing has been generalized in the Dryad framework as arbitrary dataﬂow graphs [75. but more interestingly. translation modeling [89]. In the context of text processing. Another system worth mentioning is Pregel [98]. But of course. a longer description is anticipated in a forthcoming paper [99].7. but unfortunately there are few published details at present.2 ALTERNATIVE COMPUTING PARADIGMS Streaming algorithms [3] represent an alternative programming model for dealing with large volumes of data with limited computational and storage resources. which is a sparse. ALTERNATIVE COMPUTING PARADIGMS 145 However. data from an “external” source or some other data gathering device. Stream processing is very attractive for working with time-series data (news feeds.). although it wasn’t originally designed with such an application in mind. abstractions manage complexity by hiding details and presenting well-deﬁned behaviors to users of those abstractions.This model assumes that data are presented to the algorithm as one or more streams of inputs that are processed in order. tweets. which have a large number of (relatively simple) functional units [104]. MapReduce is certainly no exception to this generalization. 7.The model is agnostic with respect to the source of these streams. and can be realized using ﬁles. Google’s Bigtable [34]. which could be ﬁles in a distributed ﬁle system. 159]. or shared memory. channels are used to transport partial results between vertices. streaming algorithms have been applied to language modeling [90]. A Dryad job is a directed acyclic graph where each vertex represents developer-speciﬁed computations and edges represent data channels that capture dependencies. if not impossible. Amazon’s Dynamo [48]. since streaming algorithms are comparatively simple (because there is only so much that can be done with a particular training instance). and one of the goals of this book has been to give the reader a better understanding of what’s easy to do in MapReduce and what its limitations are. Unfortunately. Furthermore. This process makes certain tasks easier. ﬁts the bill. which implements a programming model inspired by Valiant’s Bulk Synchronous Parallel (BSP) model [148]. The dataﬂow graph is a logical computation graph that is automatically mapped onto physical resources by the framework. given its batch-oriented design). and only once.2. it is unclear if the open-source implementations of these two systems (HBase and Cassandra. As anyone who has taken an introductory computer science course would know. At runtime. persistent multidimensional sorted map built on top of GFS. and detecting the ﬁrst mention of news event in a stream [121]. However. which is difﬁcult in MapReduce (once again.

7. Similarly. CLOSING REMARKS this begs the obvious question: What other abstractions are available in the massively distributed datacenter environment? Are there more appropriate computational models that would allow us to tackle classes of problems that are difﬁcult for MapReduce? Dryad and Pregel are alternative answers to these questions. which was inspired by Google’s Sawzall [122]. constructs in the language allow developers to specify data transformations (ﬁltering. we have been able to economically bring large clusters of machines to bear on problems of interest.) at a much higher level. we can certainly expect the development of new models and a better understanding of existing ones. at least at present. there is a proposal to add a third merge phase after map and reduce to better support relational operations [36]. Although Pig scripts (in a language called Pig Latin) are ultimately converted into Hadoop jobs by Pig’s execution engine. which approach is more appropriate for different classes of applications.They differ. Most important of all. By scaling “out” with commodity servers. Pig [114]. etc. It is merely the ﬁrst of many approaches to harness large-scaled distributed computing resources. Therefore. But this has only been possible with corresponding innovations in software and how computations are organized on a massive scale. can be described as a data analytics platform that provides a lightweight scripting language for manipulating large datasets. we have already observed the development of alternative approaches for expressing distributed computations. . it remains unclear. However. joining. They share in providing an abstraction for large-scale distributed computations. however. and in fact. MapReduce can be trivially implemented in Dryad. is the development of new abstractions that hide system-level details from the application developer. however.3 MAPREDUCE AND BEYOND The capabilities necessary to tackle large-data problems are already within reach by many and will continue to become more accessible over time. while simultaneously taking advantage of Hadoop’s data processing capabilities.146 7. Important ideas include: moving processing to the data. also. separating the what from the how of computation and isolating the developer from the details of concurrent programming. avoiding random seeks. For example. another open-source project. provides an abstraction on top of Hadoop that allows users to issue SQL queries against large relational datasets stored in HDFS. Hive [68]. and perhaps not even the best. and provide a model using which programmers can reason about computations at a massive scale without being distracted by ﬁne-grained concurrency management. in how distributed computations are conceptualized: functional-style programming. emphasizing throughput over latency for batch tasks by sequential scans through data. Dryad is more ﬂexible than MapReduce. Even within the Hadoop/MapReduce ecosystem. or BSP. These conceptions represent different trade offs between simplicity and expressivity: for example. Looking forward. the system provides a data analysis tool for users who are already comfortable with relational databases. arbitrary dataﬂows. Hive queries (in HiveQL) “compile down” to Hadoop jobs by the Hive query engine. grouping. MapReduce is not the end. These abstractions are at the level of entire datacenters. as opposed to the other way around.

None of these points are new or particularly earth shattering—computer scientists have known about these principles for decades. This. Add to that the advent of utility computing. MAPREDUCE AND BEYOND 147 fault tolerance. which has made MapReduce accessible to everyone and created the vibrant software ecosystem that ﬂourishes today. both in terms of computational resources and the impact on the daily lives of millions. MapReduce is unique in that. all these ideas came together and were demonstrated on practical problems at scales unseen before. The engineers at Google deserve a tremendous amount of credit for that. . in turn. error recovery. which eliminates capital investments associated with cluster infrastructure. Furthermore. and also for sharing their insights with the rest of the world. The golden age of massively distributed computing is ﬁnally upon us. for the ﬁrst time. and a host of other issues in distributed computing.7. large-data processing capabilities are now available “to the masses” with a relatively low barrier to entry. However.3. the engineers and executives at Yahoo deserve a lot of credit for starting the open-source Hadoop project. paves the way for innovations in scalable algorithms that can run on petabyte-scale datasets.

.

Copper Mountain Resort. Prashant Pandey. 1967. Reviews of Modern Physics. 2009. Information Retrieval. Inverted index compression using word-aligned binary codes. DOI: 10.99518. Philadelphia.149 Bibliography [1] Azza Abouzeid. DOI: 10. Neil Conway. University of California at Berkeley. The space complexity of approximating the frequency moments.1145/237814. Electrical Engineering and Computer Sciences. pages 109–126. Cloud analytics: Do we really need to reinvent the storage stack? In Proceedings of the 2009 Workshop on Hot Topics in Cloud Computing (HotCloud 09). 1995. Jeanna Neefe.5c 76 . and Randolph Wang. and Renu Tewari. Yossi Matias. and Alexander Rasin. pages 922–933. 2009. Karan Gupta. France.224066 29 [8] Vo Ngoc Anh and Alistair Moffat. Michael Dahlin. Technical Report UCB/EECS-2009-98.1103/RevModPhys. 2009.1145/224056. pages 483–485. Himabindu Pucha. Validity of the single processor approach to achieving large-scale computing capabilities. DOI: 10.47 92 [3] Noga Alon.1465560 17 [6] Rajagopal Ananthanarayanan. 32 [5] Gene Amdahl. 8(1):151–166. and Mario Szegedy. San Diego. Khaled Elmeleegy. pages 20–29. Sears. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP 1995).1023/B:INRT. DOI: 10. Daniel Abadi. BOOM: Data-centric programming in the datacenter. In Proceedings of the 35th International Conference on Very Large Data Base (VLDB 2009). Lyon. 59 [2] Réka Albert and Albert-László Barabási.237823 145 [4] Peter Alvaro. Kamil Bajda-Pawlikowski. 2005. Mansi Shah. Joseph M. Tyson Condie. Colorado. 74:47–97. 1996. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC ’96). California.74. 29 [7] Thomas Anderson. Pennsylvania. Serverless network ﬁle systems. DOI: 10. David Patterson. and Russell C. 2002.0000048490.1145/1465482. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Prasenjit Sarkar. Avi Silberschatz. Hellerstein. Statistical mechanics of complex networks. Drew Roselli. In Proceedings of the AFIPS Spring Joint Computer Conference.

and Max Welling. Carlos Castillo. Bolton Landing. Carlos Castillo. Flavio Junqueira. Aristides Gionis.1109/MM. and Fabrizio Silvestri. and Vicente López. Vancouver.1145/945445. 2007. pages 183–190. Vassilis Plachouras. The case for energy-proportional computing. British Columbia. Electrical Engineering and Computer Sciences. New York. Ion Stoica. CLOSING REMARKS [9] Michael Armbrust. and Fabrizio Silvestri. pages 6–20. 14 .2200/S00193ED1V01Y200905CAC006 8.443 9 [18] Luiz André Barroso and Urs Hölzle. PageRank increase under different collusion topologies. The impact of caching on search engines. DOI: 10. Vassilis Plachouras. Ariel Rabkin. Gunho Lee. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003).1145/1277741. Andrew Konwinski. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007). 6 [10] Arthur Asuncion. Challenges on distributed web retrieval. IEEE Micro.1109/MC. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005).1196112 82 [17] Luiz André Barroso and Urs Hölzle. Morgan & Claypool Publishers. The Netherlands. 2007. Rolf Neugebauer. pages 164–177. Rean Grifﬁth. Boris Dragovic. Flavio Junqueira.367846 83 [12] Ricardo Baeza-Yates. and Andrew Warﬁeld. 2007. Joseph.945462 6 [16] Luiz André Barroso. 23(2):22–28. Randy H. Tim Harris. and Matei Zaharia. 2009. Asynchronous distributed learning of topic models. DOI: 10. 2008. Web search for a planet: The Google cluster architecture. 2003. Steven Hand. Technical Report UCB/EECS-200928. Keir Fraser. 10. pages 17–24.1073017 4 [15] Paul Barham. Katz. In Advances in Neural Information Processing Systems 21 (NIPS 2008). Japan. Xen and the art of virtualization. DOI: 10. DOI: 10. Jeffrey Dean. pages 81–88. Canada. Above the clouds: A Berkeley view of cloud computing. Amsterdam. In Proceedings of the IEEE 23rd International Conference on Data Engineering (ICDE 2007). The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Toulouse. Chiba. 2005. DOI: 10. Vanessa Murdock. Istanbul. 2009.150 7. Patterson. 9. 2001. 2003.1277775 82 [14] Michele Banko and Eric Brill. University of California at Berkeley. Padhraic Smyth. DOI: 10.3115/1073012. 144 [11] Ricardo Baeza-Yates. and Urs Hölzle. Turkey.2007. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL 2001). Computer. 96 [13] Ricardo Baeza-Yates. Ian Pratt. 40(12):33–37.2007. Anthony D. pages 26–33.2003. DOI: 10. Alex Ho. France. David A.1109/ICDE. Armando Fox.

Czech Republic. Morgan & Claypool Publishers. Charles L. 5(1):92–128. 2005. pages 858–867. Beyond the data deluge. Popat. 2010. Large language models in machine translation. 5 [24] Léon Bottou. 1993. Gaithersburg.3. editors. Collected Fictions (translated by Andrew Hurley). Mercer.1007/b100712 144 [25] Thorsten Brants.7. Jimmy Lin. Ani Thakar. Andrew Hanushevsky. Lessons learned from managing a petabyte. Cormack. In Olivier Bousquet and Ulrike von Luxburg. In Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR 2005). Advanced Lectures on Machine Learning. 1999. LNAI 3176. Brown. 4. 2009. 134 [27] Eric Brill. Clarke. 133 [26] Thorsten Brants and Peng Xu. Science. Tony Hey. Massachusetts. Sergei Nikolaev. 323(5919):1297– 1298. 1995. and Gordon V. 2 [20] Jacek Becla and Daniel L. May 2006. pages 146–168. and Jim Gray. Marco Gori. The Mythical Man-Month: Essays on Software Engineering. 2 [21] Gordon Bell. Stephen A. Inside PageRank.1052938 98. Berlin. Michele Banko. 130 [30] Stefan Büttcher. 12 [29] Peter F. Franz J. Prague. Stochastic learning. In Proceedings of the Tenth Text REtrieval Conference (TREC 2001). and Franco Scarselli. Computational Linguistics. Ashok C. Anniversary Edition. Lecture Notes in Artiﬁcial Intelligence. Massachusetts. Designing a multi-petabyte database for LSST. Della Pietra. Cambridge. DOI: 10. 2005. Maryland. Brooks. MAPREDUCE AND BEYOND 151 [19] Jacek Becla. and Jeffrey Dean. Springer Verlag. Penguin. and Alex Szalay. 2007. Susan Dumais. Maria Nieto-Santisteban. A. Alex Szalay. and Robert L. 2 [22] Monica Bianchini. Peng Xu. Reading. SLAC Publications SLAC-PUB-12292. MIT Press. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Data-intensive question answering. Della Pietra. The mathematics of statistical machine translation: Parameter estimation. Wang. 2010. 4 [28] Frederick P. and Andrew Ng. 2001. 5. 2004. Ghaleb Abdulla. DOI: 10. ACM Transactions on Internet Technology. 19(2):263–311. 100 [23] Jorge Luis Borges. Asilomar. Information Retrieval: Implementing and Evaluating Search Engines. 78. 83 .1145/1052934. Distributed Language Models. Vincent J. California. Stanford Linear Accelerator Center. Och. pages 393–400. Addison-Wesley.

11(4):29–41. 141 [38] Kenneth W. 88 . 16(1):22–29. and Robert Gruber. pages 281–288. Cooper. James Broberg. Church and Patrick Hanks. Andrew Fikes. Athens. 2009. Computer Systems. 136 [34] Fay Chang. 1990. Yi-An Lin. Indianapolis. Long. 2010. In Advances in Neural Information Processing Systems 19 (NIPS 2006). DOI: 10.2009.2008. Cormen. DOI: 10. Computing in Science and Engineering. Indiana.3115/981863. Graph twiddling in a MapReduce world. Beijing. Santa Cruz. Sang Kyun Kim. British Columbia. Sanjay Ghemawat. MIT Press. Stott Parker.1016/j. 4(4):405–436. Future Generation Computer Systems. and D.12. China.120 103 [40] Brian F. In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI 2006). 2009. E. 145 [35] Stanley F. and Ronald L. 2009.1247602 146 [37] Cheng-Tao Chu. Map-Reduce for machine learning on multicore.001 6 [32] Luis-Felipe Cabrera and Darrell D. 133 [36] Hung chih Yang. Philipp Koehn. Canada. 1991. pages 1–28. Greece. Wallach. Swift: Using distributed disk striping to provide high I/O data rates. Christof Monz. Washington.future. Cloud computing and emerging IT platforms: Vision. An empirical study of smoothing techniques for language modeling. Tushar Chandra. Hsieh. DOI: 10. mutual information. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. Vancouver. and lexicography. DOI: 10. Raghu Ramakrishnan. pages 1029–1040.981904 5. Srikumar Venugopal. and reality for delivering computing as the 5th utility. In Proceedings of the Fourth Workshop on Statistical Machine Translation (StatMT ’09). and Russell Sears. Chen and Joshua Goodman. Charles E. 25(6):599–616. Benchmarking cloud serving systems with YCSB. CLOSING REMARKS [31] Rajkumar Buyya. Wilson C. California. Bigtable: A distributed storage system for structured data.152 7. YuanYuan Yu. 1996. pages 205–218. 2007. Map-Reduce-Merge: Simpliﬁed relational data processing on large clusters. Massachusetts. 2006. 29 [33] Chris Callison-Burch. and Ivona Brandic. Leiserson. Gary Bradski.1145/1247480. Seattle. Findings of the 2009 workshop on statistical machine translation. Michael Burrows. and Kunle Olukotun. Word association norms. Computational Linguistics. and Josh Schroeder. Chee Shin Yeo. Deborah A. Ali Dasdan. 2006. Jeffrey Dean. Introduction to Algorithms. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996). In Proceedings of the First ACM Symposium on Cloud Computing (ACM SOCC 2010). Adam Silberstein. 145 [41] Thomas H. 1990. pages 310–318. Erwin Tam. 24. Rivest. hype. Andrew Ng. Ruey-Lung Hsiao. 48 [39] Jonathan Cohen.1109/MCSE. Cambridge.

Madan Jampani. Katz. 39(1):1–38. Julian Kupiec. Jan Pedersen. In Proceedings of the Third Conference on Applied Natural Language Processing. 2010.1145/1327452. MapReduce: Simpliﬁed data processing on large clusters.3115/974499. Search Engines: Information Retrieval in Practice. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP 2007). DOI: 10. 2004. Implementation techniques for main memory database systems. 24. Washington. ACM SIGMOD Record. Reading. Avinash Lakshman. DeWitt. DeWitt and Jim Gray. Multi-domain learning by conﬁdence-weighted parameter combination. 2008. and Werner Vogels. pages 133–140. Rubin. 14(2):1–8. Parallel database systems: The future of high performance database systems. Swami Sivasubramanian. 108 [50] David J.1327492 2 [47] Jeffrey Dean and Sanjay Ghemawat. California. MapReduce: Simpliﬁed data processing on large clusters. 1992. Peter Vosshall. 79:123–149. 2009. LogP: Towards a realistic model of parallel computation. David Patterson.129894 12 [51] David J. Communications of the ACM. Bruce Croft. pages 137–150.1145/173284.1007/s10994-009-5148-0 144 . Randy H.155333 15 [44] Doug Cutting. 2007. Alex Kulesza. and Penelope Sibun.3.1145/1629175. Klaus Erik Schauser. Deniz Hastorun. Ramesh Subramonian.1145/971697. and Thorsten von Eicken. Gunavardhan Kakulapati. Trento. Italy. MapReduce: A ﬂexible data processing tool. Stevenson. 25 [46] Jeffrey Dean and Sanjay Ghemawat. 145 [49] Arthur P. Nan M. and David Wood. Eunice Santos. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004). 35(6):85–98. Maximum likelihood from incomplete data via the EM algorithm. 51(1):107–113. Michael R. ACM SIGPLAN Notices. 1992.974523 114 [45] Jeffrey Dean and Sanjay Ghemawat. San Francisco. Alex Pilchin. and Koby Crammer. 2010. Massachusetts. 83 [43] David Culler. MAPREDUCE AND BEYOND 153 [42] W. DOI: 10.1145/129888. Series B (Methodological). Richard Karp. Dempster. and Trevor Strohman. Dynamo: Amazon’s highly available key-value store. Communications of the ACM. 1984. Frank Olken. 53(1):72–77.1629198 59 [48] Giuseppe DeCandia. Donald Meztler. DOI: 10. DOI: 10. Machine Learning. 1. Leonard D. 1993. Shapiro. 28(7):1–12. Stonebraker. pages 205–220. Addison-Wesley. Journal of the Royal Statistical Society. A practical part-of-speech tagger.602261 63 [52] Mark Dredze.7. Abhijit Sahay. and Donald B. Communications of the ACM. Laird. DOI: 10. DOI: 10. DOI: 10. 1977.

33(2):51–59. Cambridge. Alex Mont. Oxford. Addison-Wesley. Chiba.122653799 85 [60] Ananth Grama.1145/564376. 2003. The strength of weak ties. Michele Banko. Web spam taxonomy. 1983. 1973. Proceedings of the National Academy of Science. Tampere. Reading. The Google File System. The American Journal of Sociology. Web question answering: Is more always better? In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002). Sociological Theory. 135.2307/202051 92 [63] Zoltán Gyöngyi and Hector Garcia-Molina. In Proceedings of the Third Workshop on Statistical Machine Translation at ACL 2008. 47. Cambridge University Press. Japan. 103 [61] Mark S. Howard Gobioff. Community structure in social and biological networks.1145/564585. Aaron Cordova. 48 [56] Qin Gao and Stephan Vogel. available. DOI: 10. Anshul Gupta. A synopsis of linguistic theory 1930–55. 93:37–46. Jimmy Lin. 2008. 136 [55] John R.1073/pnas. 1957. pages 39–47. Bolton Landing. Finland. and cheap: Construction of statistical machine translation models with MapReduce. 86 . In Studies in Linguistic Analysis. Newman. Fast. DOI: 10. pages 291– 298. and Vipin Kumar. pages 29–43. pages 1–32. 2002. George Karypis. Granovetter. Columbus. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003). The strength of weak ties: A network theory revisited. Special Volume of the Philological Society. easy. and Classiﬁcation Structures in Oceania. J. 99(12):7821–7826. 2005. 78(6):1360– 1380. The Prague Bulletin of Mathematical Linguistics.945450 29 [58] Seth Gilbert and Nancy Lynch. Brewer’s Conjecture and the feasibility of consistent. Granovetter. DOI: 10. Kinship. DOI: 10. and Andrew Ng. Eric Brill. ACM SIGACT News. and Jimmy Lin. 1996. 2010. DOI: 10.1086/225469 92 [62] Mark S. partition-tolerant web services. 14. 96 [64] Per Hage and Frank Harary.154 7. 2003. New York.564428 4 [54] Chris Dyer.1145/945445. DOI: 10. Firth. 2002. England. pages 199–207. DOI: 10. 1:201–233. Massachusetts. Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski. Blackwell. Ohio. CLOSING REMARKS [53] Susan Dumais. 2002. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). and Shun-Tak Leung. Introduction to Parallel Computing.564601 32 [59] Michelle Girvan and Mark E. Island Networks: Communication.2478/v10108-010-0004-8 138 [57] Sanjay Ghemawat.

Mahadev Satyanarayanan. Scale and performance in a distributed ﬁle system. Redmond. 2009.2009.35059 29 [75] Michael Isard. 2009. Toronto. 1988. Redmond. Wiley. Lisbon. Ontario. DOI: 10.36 5 [66] James Hamilton. On designing and deploying Internet-scale services. and Kristin Tolle. DOI: 10. Raﬁul Hassan and Baikunth Nath. Andrew Birrell. Portugal. 24(2):8–12. pages 59–72. 59. Communications of the ACM. 7(6). Yuan Yu. and Kristin Tolle. The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. 2007. The unreasonable effectiveness of data. Washington. 2009. 3 [73] Tony Hey. Michael Kazar.1145/1563821. Cooperative Expendable Micro-Slice Servers (CEMS): Low cost. 2009. In Proceedings of the 5th International Conference on Intelligent Systems Design and Applications (ISDA ’05). 2009. Mihai Budiu. In Proceedings of the Fourth Biennial Conference on Innovative Data Systems Research (CIDR 2009). The pathologies of big data. Peter Norvig. low power servers for Internet-scale services. ACM Transactions on Computer Systems. pages 192–196. Sherri Menees. 9. Canada. Wroclaw. 2009. Govindaraju. Poland.1109/MIS. The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. Dryad: Distributed data-parallel programs from sequential building blocks. Sebastopol.85 114 [71] Bingsheng He. MAPREDUCE AND BEYOND 155 [65] Alon Halevy. Mars: A MapReduce framework on graphics processors. DOI: 10.7. 2005. DOI: 10. Harris.3. Washington. Qiong Luo.1145/35037. Wenbin Fang. Information platforms and the rise of the data scientist. Dallas. California. editors. 48 [70] Md. and Dennis Fetterly. and Michael West. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007). Robert Sidebotham.1145/1454115. Stewart Tansley. Mathematical Structures of Language. Stewart Tansley.2005.1273005 145 [76] Adam Jacobs.1109/ISDA. and Tuyong Wang. 2008. editors. Microsoft Research. 146 [69] Zelig S. 10 [67] James Hamilton. 6. pages 233–244. Naga K. 6(1):51–81. and Fernando Pereira. In Proceedings of the 21st Large Installation System Administration Conference (LISA ’07). Stewart Tansley. In Toby Segaran and Jeff Hammerbacher. 3 [74] John Howard. Beautiful Data. Stock market forecasting using hidden Markov models: A new approach. Jim Gray on eScience: A transformed scientiﬁc method. New York. Asilomar.1145/1272998. California. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT 2008). Texas.1454152 20 [72] Tony Hey. In Tony Hey. 2007. DOI: 10. Microsoft Research. O’Reilly. 1968. and Kristin Tolle. pages 73–84. David Nichols. pages 260–269. DOI: 10.1563874 11 . ACM Queue. 10 [68] Jeff Hammerbacher.

Technical Report CMU-ML-08-117. Upper Saddle River. Journal of the ACM. Speech and Language Processing. 46(5):604–632. 120 [79] Daniel Jurafsky and James H. Miami. New Jersey.1073462 131 [87] John D. Cambridge.3115/1073445. Statistical methods for speech recognition.1145/1352135. Christos Faloutsos. 138 . An Introduction to Parallel Algorithms. Addison-Wesley.2009. HADI: Fast diameter estimation and mining in massive graphs with Hadoop. pages 116–120. Lafferty. Portland. Franz J. Cambridge. A model of computation for MapReduce. 2001. Statistical Machine Translation. 130. Massachusetts. pages 282–289. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. Reading. Charalampos Tsourakakis. PEGASUS: A peta-scale graph mining system—implementation and observations. Authoritative sources in a hyperlinked environment. Cluster computing for Web-scale data processing. and Christophe Bisciglia. 2009. and Daniel Marcu. Siddharth Suri. San Francisco. 103 [81] U Kang. pages 48–54. 2003. 14. 1999. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). 103 [78] Frederick Jelinek. Cambridge University Press. California. 1992. Texas. 2008. Edmonton. 2010. Ana Paula Appel. Andrew McCallum. Oregon. pages 229–238. 15 [83] Aaron Kimball. In Proceedings of the 39th ACM Technical Symposium on Computer Science Education (SIGCSE 2008). DOI: 10. Carnegie Mellon University. School of Computer Science. and Fernando Pereira. Alberta. Sierra Michels-Slettvet. and Jure Leskovec. 2010. Canada. Kleinberg.1109/ICDM. 5 [80] U Kang. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining (ICDM 2009). Och. and Christos Faloutsos. 95 [85] Philipp Koehn. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2010).1145/324133. DOI: 10. Charalampos E. 112. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003).1352177 71 [84] Jon M. DOI: 10.156 7. Tsourakakis. 1997. CLOSING REMARKS [77] Joseph JaJa. Statistical phrase-based translation. 133 [86] Philipp Koehn. 114. 2009. England. 2008. Martin. Floria. Austin. Pearson. MIT Press.324140 65. Massachusetts. and Sergei Vassilvitskii.14 103 [82] Howard Karloff. DOI: 10.

DOI: 10.1145/1583991. Triple-parity RAID and beyond.1145/382979. Bik.1007/BF01589116 139 [97] Adam Lopez. Liu. 63 [96] Dong C. 145 . [92] Jimmy Lin. 47. Maryland. Honolulu. Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with MapReduce. highthroughput access to static global resources within the Hadoop framework. Aart J. Low-latency. Dong C. Stream-based randomised language models for SMT.1584010 86. pages 419–428. Columbus. 2009. page 6. Exploring large-data issues in the curriculum: A case study with MapReduce. DOI: 10. Jorge Nocedal. 71 [94] Jimmy Lin. 51 [95] Jimmy Lin. MAPREDUCE AND BEYOND 157 [88] Ronny Lempel and Shlomo Moran. Dehnert. Liu. 2008. 2009. 2010. 27(2):1–55. ACM Transactions on Information Systems. DOI: 10. Stream-based translation models for statistical machine translation. 2008. 7(11). In Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010). 45(3):503–528. Singapore.1145/1661785. Matthew H. pages 54–61. ACM Computing Surveys.7. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Shravya Konda. C. An exploration of the principles underlying redundancy-based factoid question answering. James C. 2009. Austern. California. and Jorge Nocedal. 2001. Ilan Horn. University of Maryland.1145/1380584. SALSA: The Stochastic Approach for LinkStructure Analysis. Technical Report HCIL-2009-01. DOI: 10. pages 756–764. 145 [91] Adam Leventhal.1229180 4 [93] Jimmy Lin. Calgary. In Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL-08) at ACL 2008. 95 [89] Abby Levenberg. Canada. Mathematical Programming B. and Grzegorz Czajkowski. Pregel: A system for large-scale graph processing. Ohio. DOI: 10. 145 [90] Abby Levenberg and Miles Osborne. and Samantha Mahindrakar. and Miles Osborne. 40(3):1–49.383041 65. On the limited memory BFGS method for large scale optimization. Anand Bahety. 19(2):131–160. DOI: 10.3.1145/1229179. January 2009. 2007. Hawaii. College Park. Chris Callison-Burch.1380586 130 [98] Grzegorz Malewicz. 2008. Statistical machine translation.1670144 2 ACM Queue. In Proceedings of the 28th ACM Symposium on Principles of Distributed Computing (PODC 2009). Alberta. Los Angeles. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Naty Leiser. 1989. ACM Transactions on Information Systems.

Jasmine Novak. 1999. Scalable programming models for massively multicore processors.12. Miller. The impact of next-generation sequencing technology on genetics. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002). pages 214–221. and Justin Zobel. 2001. Austern.1016/j. Aart J. 145 [105] Marshall K. 2008. McCool. England. Schwartz. Naty Leiser.1145/312624. 83 [102] Christopher D.158 7. Cambridge University Press. 2002.2007. Building enriched document representations using aggregated anchor text. Washington. Mardis. A hidden Markov model information retrieval system. 2008. Improving memory hierarchy performance for irregular applications using data and computation reorderings. Cambridge. Ilan Horn. 42. pages 219–226. 88 [108] David R. Hang Cui. DOI: 10. Manning.1582723 86. CLOSING REMARKS [99] Grzegorz Malewicz. 2009. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999). 96(5):816–831.1145/1582716. Berkeley. Indiana. An Introduction to Information Retrieval. Prabhakar Raghavan. McKusick and Sean Quinlan. ACM Queue.1594206 32 [106] John Mellor-Crummey. A comparison of algorithms for maximum entropy parameter estimation. Dehnert. William Webber. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). 7(7). DOI: 10. 2010.1148232 82 . MIT Press. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. and Hinrich Schütze. 2008.1109/JPROC. Load balancing for term-distributed parallel retrieval.3115/1118853. Tim Leek. James C. Cambridge. Indianapolis. and Richard M. and Srihari Reddy.tig. Taiwan. DOI: 10. DOI: 10. DOI: 10. Foundations of Statistical Natural Language Processing. and Ken Kennedy. GFS: Evolution on fast-forward.312680 114 [109] Alistair Moffat. pages 348–355. Trends in Genetics. Matthew H. 29(3):217–247. 1999. Manning and Hinrich Schütze. 145 [100] Robert Malouf.1118871 140 [101] Christopher D.917731 13. Taipei. Pregel: A system for large-scale graph processing. Bik. 24(3):133–141.1023/A:1011119519789 101 [107] Donald Metzler. C. H. pages 49–55. DOI: 10. David Whalley.1145/1571941.1145/1148170. and Grzegorz Czajkowski. DOI: 10.007 2 [104] Michael D. International Journal of Parallel Programming. California.1571981 68. 5. Seattle.1145/1594204. DOI: 10. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006). Massachusetts. Proceedings of the IEEE.2008. 133 [103] Elaine R. 2009. 2006. DOI: 10.

Washington. 2008.1095418 13 [116] Sean Owen and Robin Anil. DOI: 10. Connecticut. and Terry Winograd. Canada. John Lafferty. In Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. 52(1):105. 138 [111] Daniel Nurmi. DOI: 10.. 141 [117] Lawrence Page. 1999. Patterson. 2009.1376726 59. The data center is the computer.. 146 [115] Kunle Olukotun and Lance Hammond. ACM Queue. 2008. and Andrew McCallum. Rich Wolski. Sunil Soman. and Andrew Tomkins.1145/1376616. DOI: 10. Daniel J. Och and Hermann Ney. Samuel Madden. Chris Grzegorczyk. 29(1):19–51. 95. Sergey Brin. 2009.1162/089120103321337421 135 [113] Christopher Olston and Marc Najork. Using maximum entropy for text classiﬁcation. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. Stanford University. 2010. and Dmitrii Zagorodnov. Lamia Youseff. 2005.2009. Mahout in Action. Benjamin Reed. Pig Latin: A not-so-foreign language for data processing. Web crawling. Ravi Kumar. Stanford Digital Library Working Paper SIDL-WP1999-0120.1561/1500000017 67 [114] Christopher Olston. A systematic comparison of various statistical alignment models.7. 1999.1109/CCGRID. DOI: 10. The future of microprocessors. 2(1–2):1–135. In Proceedings of the 11th Annual Conference of the North American .3. Manning Publications Co. Abadi. Vancouver. Communications of the ACM. DOI: 10. pages 165–178.1145/1095408. 2008. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.1561/1500000011 3 [119] David A. 2010. DOI: 10.C. Alexander Rasin. British Columbia. The PageRank citation ranking: Bringing order to the Web. Sweden. David J. Providence. In Proceedings of the 35th ACM SIGMOD International Conference on Management of Data. Miles Osborne. Erik Paulson.1559865 59 [121] Sasa Petrovic. Streaming ﬁrst story detection with application to Twitter. and Michael Stonebraker. 100 [118] Bo Pang and Lillian Lee. Rajeev Motwani. Graziano Obertelli. DeWitt. Rhode Island. Greenwich. 65. D. A comparison of approaches to large-scale data analysis. 4(3):175–246. DOI: 10. 2003. and Victor Lavrenko. Computational Linguistics. Utkarsh Srivastava.93 7 [112] Franz J. 14 [120] Andrew Pavlo. pages 124–131. Foundations and Trends in Information Retrieval. The Eucalyptus open-source cloud-computing system. pages 61–67. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval.1145/1559845. 3(7):27– 34. pages 1099–1110. MAPREDUCE AND BEYOND 159 [110] Kamal Nigam. Stockholm.

146 [123] Eduardo Pinheiro. IBM Systems Journal. Rappa. DOI: 10. PhD thesis. and Christos Kozyrakis. DOI: 10. pages 205–218. pages 267–296. 2007. Mustafa Raﬁque.Washington. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). Ross. A tutorial on hidden Markov models and selected applications in speech recognition. Seattle. Robert Griesemer. Wolf-Dietrich Weber. Stochastic processes. Supporting MapReduce on large-scale asymmetric multi-core clusters. In Readings in Speech Recognition.346181 20 [128] Delip Rao and David Yarowsky. New York. Monterey. 2010. Scientiﬁc Programming Journal. 1990. 34(1):32–42. DOI: 10. Wiley. High Performance Computing for DNA Sequence Alignment and Assembly. 2009. In Proceedings of the First USENIX Conference on File and Storage Technologies. Failure trends in a large disk drive population.1109/HPCA. In Proceedings of the ACL/IJCNLP 2009 Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-4).1555384 25 [132] Michael Schatz. Arun Penmetsa. The utility business model and the future of computing services. Morgan Kaufmann Publishers. California.1147/sj. Benjamin Rose.431.18626 115. ACM Operating Systems Review. Davison. 114 [131] Thomas Sandholm and Kevin Lai. 145 [122] Rob Pike. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). 2009. Arizona. Phoenix. Web page classiﬁcation: Features and algorithms.1145/1555349. DOI: 10. 2010. 103 [129] Michael A. 2008. 2004. San Jose. Ramanan Raghuraman. Rabiner. ACM Computing Surveys.1531800 20 [127] Colby Ranger. California. College Park. 41(2). 13(4):277–298. CLOSING REMARKS Chapter of the Association for Computational Linguistics (NAACL HLT 2010). Interpreting the data: Parallel analysis with Sawzall.0032 6 [130] Sheldon M. 1996. and Dimitrios S.1459357 82 [125] Lawrence R. 10. DOI: 10. and Luiz André Barroso. 43(2):25–34. Evaluating MapReduce for multi-core and multiprocessor systems. Butt. DOI: 10. Ali R. Sean Dorward. 103 [133] Frank Schmuck and Roger Haskin. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA 2007).1145/1459352. GPFS: A shared-disk ﬁle system for large computing clusters. 29 . 120 [126] M.1145/1531793. pages 299–310.160 7.1109/5. 26 [124] Xiaoguang Qi and Brian D. 2005. MapReduce optimization using regulated dynamic prioritization. Nikolopoulos. 2009. 2009. Gary Bradski. University of Maryland. Singapore. Ranking and semi-supervised classiﬁcation on large scale graphs using Map-Reduce. Los Angeles. and Sean Quinlan. San Francisco. California. pages 231–244. 2002. California.2007.

Portland. Log-linear models. Automatic word sense discrimination. Washington. and Alexander Rasin. Orlando.3. Named Entities: Recognition. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS ’09). Edmonton. 26 [136] Hinrich Schütze. 24(1):97– 123. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003). Florida. 53(1):64–71. October 2003. Stewart Tansley. Sam Madden. Beyond the tsunami: Developing the infrastructure to deal with life sciences data.cs.1093/bioinformatics/btg1080 114 [144] Michael Stonebraker. Amsterdam.edu/˜nasmith/papers/smith. 2009. 3 [139] Kristie Seymore. Canada. Classiﬁcation and Use. In Tony Hey. Alberta.1073473 140 [141] Noah Smith. 2 [143] Mario Stanke and Stephan Waack.1145/67544. http://www. 19 Suppl 2:ii215–225. Oregon. 140 [142] Christopher Southan and Graham Cameron. and Kristin Tolle. Shallow parsing with conditional random ﬁelds. Pedersen. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction. Erik Paulson.1145/1629175.1629197 59 . Daniel Abadi. editors.1555372 10. Gene prediction with a hidden Markov model and a new intron submodel. pages 110–121. and Wolf-Dietrich Weber. Andrew Mccallum. and Ronald Rosenfeld. Learning hidden Markov model structure for information extraction. DeWitt. DeWitt. 114 [140] Fei Sha and Fernando Pereira. 1999. pages 193–204. Bioinformatics. 48 [137] Hinrich Schütze and Jan O. Redmond.cmu. 1998. 2003. MAPREDUCE AND BEYOND 161 [134] Donovan A. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment.66937 60 [135] Bianca Schroeder.1016/S0306-4573(96)00068-4 48 [138] Satoshi Sekine and Elisabete Ranchhod. pages 37–42. 2009. Schneider and David J. 33(3):307–318. 2004. 1998. Microsoft Research.1145/1555349. DOI: 10. A cooccurrence-based thesaurus and two applications to information retrieval. 2010. Information Processing and Management. Computational Linguistics.3115/1073445. DOI: 10. DOI: 10. DRAM errors in the wild: A large-scale ﬁeld study. 1989. tut04.7. pages 134–141. Eduardo Pinheiro. David J. DOI: 10. Washington. MapReduce and parallel DBMSs: Friends or foes? Communications of the ACM.pdf. John Benjamins. 2009. Seattle. DOI: 10. The Fourth Paradigm: Data-Intensive Scientiﬁc Discovery. DOI: 10. The Netherlands. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data. Andrew Pavlo.

Thekkath. Watts and Steven H. 2008. DOI: 10. DOI: 10.79181 13. Timothy Mann. Szalay.1038/30918 92 [153] Xingzhi Wen and Uzi Vishkin.162 7. Nature. 33(8):103–111. DOI: 10. DOI: 10. In Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996). 2009. Technical Report CMU-PDL-08-114. 15. Copenhagen.1496100 6 [150] Stephan Vogel. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP 1997). and Edward K. Hermann Ney. 1990. pages 55–66. Saint-Malo.1145/1366230. 1997. and Edward Y. Swapnil Patil. 2009. Frangipani: A scalable distributed ﬁle system. 86. pages 224–237.1145/335191. 1998. 15. California. PLDA: Parallel latent Dirichlet allocation for large-scale applications. DOI: 10.3160130102 5 . Denmark. pages 301– 314. 14. Chang. and Garth Gibson. Strogatz. In Proceedings of the 5th Conference on Computing Frontiers. Collective dynamics of ‘small-world’ networks. Hadoop: The Deﬁnitive Guide. 1996. Juan Caceres. 393:440–442. SIGMOD Record.1366240 14 [154] Tom White. Parallel Data Laboratory. Carnegie Mellon University. Wen-Yen Chen. 2000.266694 29 [148] Leslie G. ACM SIGCOMM Computer Communication Review. Sebastopol. 29(2):451–462. O’Reilly. Communications of the ACM. France. 23 [155] Eugene Wigner. 1960. Designing and mining multi-terabyte astronomy archives: The Sloan Digital Sky Survey. Communications in Pure and Applied Mathematics. Valiant. San Francisco. 29 [147] Chandramohan A. 13(1):1–14. Jim Gray. The unreasonable effectiveness of mathematics in the natural sciences. Don Slutz. and Christoph Tillmann. DOI: 10. Lee.3115/993268.1002/cpa. Ani Thakar. Data-intensive ﬁle systems for Internet services: A rose by any other name….993313 114. FPGA-based prototype of a PRAM-On-Chip processor. California. 145 [149] Luis M. A break in the clouds: Towards a cloud deﬁnition. HMM-based word alignment in statistical translation.1145/1496091.335439 2 [146] Wittawat Tantisiriroj. Kunszt. 39(1):50–55. DOI: 10. and Maik Lindner. DOI: 10. CLOSING REMARKS [145] Alexander S. and Robert J. pages 836–841. Luis Rodero-Merino.1007/978-3-642-02158-9_26 141 [152] Duncan J. 2009. Matt Stanton. Hongjie Bai. Brunner.1145/268998. A bridging model for parallel computation. 135 [151] Yi Wang. Peter Z. Ischia. In Proceedings of the Fifth International Conference on Algorithmic Aspects in Information and Management (AAIM 2009). DOI: 10. Vaquero.1145/79173. 2008. Italy.

San Diego.2005. IEEE Transactions on Neural Networks. 83 [157] Jinxi Xu and W. San Diego. 83 . 145 [160] Matei Zaharia. 25 [161] Matei Zaharia. Survey of clustering algorithms. University of California at Berkeley. and Timothy C. Bruce Croft.1145/267954.267957 48 [158] Rui Xu and Donald Wunsch II. 1999. Pradeep Kumar Gunda. Job scheduling for multi-user MapReduce clusters. DOI: 10. and Ion Stoica. 25 [162] Justin Zobel and Alistair Moffat. 38(6):1–56. California. California. DOI: 10. 2006. Mihai Budiu. In Proceedings of the 8th Symposium on Operating System Design and Implementation (OSDI 2008). Inverted ﬁles for text search engines. Morgan Kaufmann Publishing. Randy Katz. Anthony D. pages 29–42. Andy Konwinski. Corpus-based stemming using cooccurrence of word variants. 2009. Dhruba Borthakur. Úlfar Erlingsson. MAPREDUCE AND BEYOND 163 [156] Ian H. DOI: 10. Witten. Joydeep Sen Sarma. 78. Alistair Moffat. 78. San Francisco. 2008. Technical Report UCB/EECS-2009-55. Dennis Fetterly. Michael Isard.1109/TNN. 16(3):645–678. Improving MapReduce performance in heterogeneous environments. and Jon Currey.1023/A:1011472308196 69.845141 86 [159] Yuan Yu. 1998. Electrical Engineering and Computer Sciences. California. In Proceedings of the 8th Symposium on Operating System Design and Implementation (OSDI 2008). 77. 2005. ACM Transactions on Information Systems.3. Khaled Elmeleegy. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. 2008. pages 1–14. Joseph. Bell. Scott Shenker. 16(1):61–81. ACM Computing Surveys. Managing Gigabytes: Compressing and Indexing Documents and Images. and Ion Stoica.7.

.

D. He received his Ph.13. in Linguistics from the University of Maryland. and society. He directs the recently-formed Cloud Computing Center. people. an interdisciplinary group that explores the many aspects of cloud computing as it impacts technology.D. and gained further experience with MapReduce during an internship with Google Research in 2008. College Park in June. . Lin’s research lies at the intersection of natural language processing and information retrieval. CHRIS DYER Chris Dyer is graduating with a Ph.165 Authors’ Biographies JIMMY LIN Jimmy Lin is an Associate Professor in the iSchool (College of Information Studies) at the University of Maryland. 2010 and will be joining the Language Technologies Institute at Carnegie Mellon University as a postdoctoral researcher. He ﬁrst became acquainted with MapReduce in 2007 using Hadoop. with a recent emphasis on scalable algorithms and large-data processing. and he has served as a reviewer for numerous conferences and journals in the areas of natural language processing and computational linguistics.0. from MIT in Electrical Engineering and Computer Science in 2004. His research interests include statistical machine translation and machine learning. version 0. College Park.

- MapReduce Book Final
- MapReduce Book Final
- NWU-EECS-08-05
- LSD1502 - Defeating Jamming With the Power of Silence
- 48561092-wipro-question-paper.pdf
- wipro question paper
- Automatic Reconfiguration for Large-scale Reliable
- wsnppt
- MQP_10CP
- D4.2 Fireball Portal Final
- Journal of Computer Science IJCSIS Vol. 11 No. 1 January 2013