You are on page 1of 11

Why is Google a verb?

Matthew Lawler

13 Feb 2018 Matthew Lawler lawlermj1@gmail.com 1


+
What is in this presentation for you?
• How does Google process petabytes of
data daily?
• Can you learn from them?
• To interest you in using this tool for your work.

from 'Does IT matter?' by Nicholas Carr 2


Presentation Purpose
• To outline
– What is the secret sauce behind Google
• This is aimed at a technical audience.
• This is an article summary.
• Contents
– Why is this important?
– What is MapReduce?
– Why use MapReduce?
– MapReduce in detail
– LISP example
– MapReduce Flow Chart
– Extras
– Conclusion
• Questions as you like

2009 Why is Google a verb? MapReduce! 3


Why is Google a Verb?
• Google exploits an affect called "The
Wisdom of Crowds", which argues that large
groups hold a collective wisdom. How?
– In 2004, two Google developers published how
they were able to do a complete rewrite of
indexing for Google web search service, across
20 TB of data.
– This is explained in an article called
"MapReduce: Simplified Data Processing on
Large Clusters" by Jeffrey Dean and Sanjay
Ghemawat.
– This presentation summarises that article.

2009 Why is Google a verb? MapReduce! 4


What is MapReduce?
• MapReduce is a programming model and an
associated implementation for processing
and generating large data sets.
– Users specify
• a map function that processes a key/value pair to
generate a set of intermediate key/value pairs, and
• a reduce function that merges all intermediate values
associated with the same intermediate key.
– Many real world tasks are expressible in this
model.
• The MapReduce abstraction is inspired by
the map and reduce primitives present in
Lisp, which is a functional language.

2009 Why is Google a verb? MapReduce! 5


Why use MapReduce?
• Functional style makes parallelization possible.
• When fault-tolerance, data distribution and load balancing
as added in, Google is able to achieve massive Scalability.
• Programs written in this style are automatically parallelized
and executed on a large cluster of commodity machines.
• The run-time system takes care of partitioning, scheduling,
handling machine failures, and inter-machine
communication.
• This allows programmers without any experience with
parallel and distributed systems to easily utilize the
resources of a large distributed system.
• Once understood, Programmers find the system easy to
use.
• Result => Hundreds of MapReduce programs have been
implemented and upwards of one thousand MapReduce jobs
are executed on Google's clusters every day.

2009 Why is Google a verb? MapReduce! 6


MapReduce in detail
• The MapReduce Model takes a set of input key/value pairs, and
produces a set of output key/value pairs.
• The user of the MapReduce library expresses the computation as
two user written functions: Map and Reduce.
• The Map function takes an input pair and produces intermediate
key/value pairs.
– (map (key1, value1)) => (key2, value2)
• The Reduce function accepts an intermediate keys/values and
merges these values to form a possibly smaller set of values.
• Typically, zero or one output value is produced per Reduce call.
– (reduce (key2, value2)) => (value3)
• Eg, how to count each word across a collection of documents.
– The map function emits each word plus a count.
– The reduce function sums together all counts for each word.

2009 Why is Google a verb? MapReduce! 7


MapReduce Examples
• Add one to each list element:
– (mapcar #'1+ '(1 2 3)) => (2 3 4)
• Determine the length of each sub list
– (mapcar 'length '((a)(a b)(a b c))) =>
(1 2 3)
• Sum all list elements
– (reduce #'+ '(1 2 3 4)) => 10

2009 Why is Google a verb? MapReduce! 8


MapReduce Flowchart
P User Fork (1) P Master
Assign Reduce (2)
Program
Master Wake Up MapReduce Completed (10)
Assign Map (2)
Completed Reduce (9)
Completed Map (5)

Map Phase Reduce Phase

P Worker 1 Checks for File (6) P Worker 4

D Input File: Split 1 D Intermediate File: P Worker 3 D Output File: 1


P Worker 2
Machine 1

Read (3) Local Write (4) Remote Read (7) Write (8)

D Input File: Split 2


D Intermediate File: D Output File: 2
Machine 2
D Input File: Split 3
D Intermediate File:
Machine 3

MapReduce (System Architecture)


System Architect
Sun Sep 27, 2009 21:02
Comment
(1) Master is invoked by User Program.
(2) Master assigns M Map Workers and R Reduce Workers
(3) Each Map Worker reads pre-split input data.
(4) Each Map Worker writes local internediate data.
(5) Each Map Worker notifies Master of completion
(6) Reduce Worker checks for Map Worker output.
(7) Each Reduce Worker reads the smaller internediate file remotely.
(8) Each Reduce Worker writes the summarised output to a small number of files.
(9) Each Reduce Worker notifies Master of completion.
(10) Master wakes User program with final results.

2009 Why is Google a verb? MapReduce! 9


Extras
• Performance
– Network bandwidth is a relatively scarce resource.
– All input data is stored on local disks in the Google cluster.
– All files divided into 64 MB blocks.
• Certainty
– When the user-supplied map and reduce operators are
deterministic functions of their input values, the Google
MapReduce implementation produces the same output as
would have been produced by a non-faulting sequential
execution of the entire program.
– This is a natural outcome of using a functional approach.
• Other matters in the article, but not in this presentation.
– Worker Failure, Master Failure, Stragglers, Ordering
Guarantees, Combiner Function, Input and Output Types,
Skipping Bad Records, Local Execution for debugging, Status
Information, Counters

2009 Why is Google a verb? MapReduce! 10


Conclusions
> The MapReduce model has been successful at Google
because:
1. The model is easy to use, even for programmers without
experience with parallel and distributed systems.
2. Many problems can use the MapReduce model/design pattern.
3. MapReduce scales to clusters of thousands of machines.
> Google have learned several things from this work.
1. Restricting the programming model makes it easy to parallelize,
distribute and make fault-tolerant computations.
2. Network bandwidth is a scarce resource.
3. Redundant execution can be used to reduce the impact of slow
machines, and to handle machine failures and data loss.

2009 Why is Google a verb? MapReduce! 11

You might also like