You are on page 1of 92

Big Data Parallel Programming

Part 1: Course Introduction

Slawomir Nowaczyk

Slawomir
Slawomir Nowaczyk,
Nowaczyk, Halmstad
Halmstad University
University BDPP, Lecture 1 SlideSlide
1
Big Data Parallel Programming Course

Goals
Processing vast amounts of data is at the core of data mining, deep
learning and real-time autonomous decision making. All these are in
turn at the core of modern artificial intelligence applications. Data
can reside more or less permanently in the cloud and accessed via
distributed file systems and/or be streamed in real-time from
multiple sensors at very high rates.

Access to data is done through well-engineered frameworks where both storage


and processing are distributed, so that operations can be done in parallel. This
course introduces you to this infrastructure, including parallel programming for
the implementation of these frameworks. This should enable you to judge how
to choose a framework for your applications, identify pros and cons, suggest
improvements and even implement specific adaptations that you may need.
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 9
Three Topics in the BDPP Course

4 lectures, 2 labs & a project 1 lecture & 1 lab 2 lectures & 2 labs

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 10


Big Data Parallel Programming - teachers

Sepideh Pashami

Slawomir Nowaczyk

Urban Bilstrup

Abdallah Alabdallah
Peyman Mashhadi Mahmoud Rahat

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 11


Learning Outcomes

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 12


Teaching Format
Lectures
We will cover distributed file systems & processing concepts, parallel
programming paradigms, specialized methods, frameworks and
tools. The lectures will introduce the material in the literature.
Labs
Labs include hands-on exercises on data-intensive computing and
parallel programming. They are used to allow you to develop skills
and practical abilities in the area. Labs are mandatory & graded.
Project
Final project on data-intensive computing (Spark part) gives you a
chance to demonstrate the knowledge acquired during the course in
the context of a realistic application. This is by far the largest part of
the course, and it determines the final grade.
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 13
Formalities

Examination
To pass the course you need to pass:
• the labs (2.5 credits, Pass/Fail) – work in pairs, reports in Jupyter.
• the project (5 credits, 5-4-3/Fail) – individual work; design and
implement a program, write a report & pass oral exam [which
includes a presentation of your project and answering questions
both about your solution as well as general course material]
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 14
Questions?
Big Data Parallel Programming
Part 2: Parallel Programming

Slawomir Nowaczyk

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 16


Parallel Programming

Why? How?
The only way to scale up! Divide and conquer!

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 17


Two Kinds of Parallelism
Data parallelism
The same task run on
different data in parallel

Task parallelism
Different tasks running on
the same data in parallel

Hybrid & Pipeline


In practice, most actual
solutions lie somewhere
along this spectrum…
but it’s not balanced!
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 18
Two Very Different Approaches

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 19


Two Very Different Frameworks

General-purpose frameworks
The course includes modern techniques, methods and tools for
distributed storage (e.g., distributed and fault-tolerant key-value
tables including replication and coordination mechanisms) as well as
for distributed processing (e.g., frameworks such as MapReduce
and Spark). It covers both static and streamed massive data. As a
principle, all kinds of algorithms and data types are supported.

Specialised frameworks
In addition, the course includes concepts, methods and tools for
parallel programming for homogenous clusters, in particular CUDA
library for GPUs. As a principle, this kind of processing is highly
specialised and supports only a limited set of operations.
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 20
Parallel Programming is hard

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 23


Pipeline parallelism

Assumption
Output of one task is the input to the next
Benefits
All the tasks can run in parallel
Limitations
Limited by the highest-latency element
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 24
Programming the Pipeline

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 26


Programming the Pipeline
• Specialised compute-intensive highly parallel repetitive processing
o originally designed for graphics and physics calculations (in computer games)
o now mostly used for Machine Learning with Deep Artificial Neural Networks

• Heterogeneous computing model


o two interacting units CUDA architecture

o host (the CPU)


o device (the GPU)

• Matrices / tensors
o a single class of problems
o common & important
o worth creating dedicated
specialised hardware
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 27
Spark
Why? – Process Big Data in a flexible way, with arbitrary code
How? – Build upon ideas from functional programming
The trick? – Forbid global state & automate worker management

CUDA
Why? – Make very repetitive computations on data run very fast
How? – Only allow simple math operations on tensors/matrices
The trick? – Use dedicated hardware, carefully designed to the task

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 28


Big Data Parallel Programming
Part 3: Parallel Programming Example

Slawomir Nowaczyk

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 29


Simple Example

Goals
• Introduce a little bit of Python programming
• Discuss the difference between having “small” or “big” data
• Design an algorithm for counting the number of word occurrences
 i.e., how many times each word is used in a given text

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 30


Counting Word Occurrences in a (small) File

An Algorithm
• Read all the text from the file, and split into words
• Count the number of occurrences of each unique word
• Write the output to a file or a screen
 each unique word & the number of times it occurs in the file

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 31


Simple Code

Reading text from a file


• Control structure with takes care of resource management
 it will open the file and make sure it gets closed correctly
• Method read() loads the content of the file into memory
• Method split() applied to a string returns a list of sub-strings
separated by whitespaces and newlines

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 32


A Little on Data Structures
Dictionary in Python
A data structure that allows you to:
• create an empty dictionary
• insert a {key, value} pair
• efficiently(!) lookup the value
associated to a key
• remove a key, value pair
• iterate through all the keys
It is usually implemented using hash
tables or search trees

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 33


Counting Words Algorithm
Python dict reference • Create an empty dictionary
d = dict()  keys will be words
d[key] = value
 values will be counts
d.get(key,if_not)
key in d • Insert new words with count=1
del d[key] • Update existing words by incrementing
for key in d: the value already in the dictionary

Method word.lower() turns all characters to lower case


(we consider the word “Word” to be the same word as “word”)

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 34


The Complete Python Program

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 35


Consider “Big Data”

Count the occurrences of words on the web


Based on estimate of J Leskovec, A Rajaraman, J Ullman (Stanford
University, mmds.org) “the web” is 20+ billion web pages at 20kb…
That’s a total of 400+ TB of data…
Given that a single computer reads ≈30-35 MB/sec from HDD/SSD…
We would need approximately 4 months to read all the web…
And ≈200 hard drives to store this information while processing!

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 36


Counting Word Occurrences in a (BIG) File

Goals
• Present an algorithm that can be parallelised
• A little bit more Python programming
• Discuss the abstract structure of the new algorithm
• Consider other problems that exhibit similar structure
 i.e., that can be solved with the same class of algorithms
• Motivate the need for general-purpose distributed infrastructure

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 37


Counting Word Occurrences in a (BIG) File

Small File (a reminder)


• Read all the text from the file, and split into words
• For each word, increment the count in a dictionary

Divide-and-Conquer
• Divide file into parts, read text from each part, and split into words
• Count the number of occurrences of unique words in each part
• Merge all the partial results together (somehow)
• Write the output to a file or a screen

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 38


Data Structures Again
Lists rather than dictionaries
A dictionary data structure is quite
complicated internally, and not
easily parallelisable.
However, we can use lists instead
• they are much simpler to maintain
• support the same operations, e.g.,
add/remove elements & iteration
• efficient membership test if sorted
• lists can be merged in linear time
In particular, lists of key-value pairs.
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 39
Counting Words – Parallel Version
[these, are, some, words, and, these, are, more, words]
Algorithm
[(’these’, 1), (’are’, 1), (’some’, 1), (’words’, 1),
Pair each word read from (’and’, 1), (’these’, 1), (’are’, 1), (’more’, 1),
the file with a number “1” (’words’, 1)]

Sort the resulting list of [(’and’, 1), (’are’, 1), (’are’, 1), (’more’, 1),
pairs in the word order (’some’, 1), (’these’, 1), (’these’, 1),(’words’, 1),
(’words’, 1)]
Group all “1”s for each,
i.e., now pair each word [(’and’, [1]), (’are’, [1, 1]), (’more’, [1]),
(’some’, [1]), (’these’, [1, 1]), (’words’, [1, 1])]
with a list of “1”s
[(’and’, 1), (’are’, 2), (’more’, 1), (’some’, 1),
Sum up all the numbers in (’these’, 2), (’words’, 2)]
the list for each word

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 40


Counting Words – Parallel Version
[these, are, some, words, and, these, are, more, words]
Algorithm
Eac [(’these’, 1), (’are’, 1), (’some’, 1), (’words’, 1),
Pair each word read from
ho (’and’, 1), (’these’, 1), (’are’, 1), (’more’, 1),
the file with a number “1”
f th
(’words’, 1)]

Sort the resulting list of ese


par
[(’and’, 1), (’are’, 1), (’are’, 1), (’more’, 1),

alle steps c
pairs in the word order (’some’, 1), (’these’, 1), (’these’, 1),(’words’, 1),
(’words’, 1)]
Group all “1”s for each,
li se an b
i.e., now pair each word
d!
[(’and’, [1]), (’are’, [1, 1]), (’more’, [1]),
with a list of “1”s e
(’some’, [1]), (’these’, [1, 1]), (’words’, [1, 1])]

[(’and’, 1), (’are’, 2), (’more’, 1), (’some’, 1),


Sum up all the numbers in (’these’, 2), (’words’, 2)]
the list for each word

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 41


The Abstract Algorithm
Map – Reduce
The Map-Reduce framework consists of two steps… the first is called
“Map” and the second is called “Reduce” (after LISP functions).

Map
Apply the same function to all the elements in the input list.
In the example, it was “pair with 1” – that’s a function. This can be
done in parallel, over chunks of the input, equal or different sizes…
Then, instead of one list, we will get several lists!
Reduce
The last part is called reducing: we reduce a list of values to a value
by applying an accumulating function: plus This can also be done in
parallel, we can take parts of the list and reduce those words.
Instead of one output file we might get several output files!
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 42
Big Data Parallel Programming
Part 4: MapReduce Paradigm

Slawomir Nowaczyk

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 43


Counting Word – MapReduce in Python

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 44


The Secret/Automatic Step

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 45


The Goal of “Parallel Platforms/Frameworks”

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 46


What is a Distributed System?

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 47


Distributed System as a Middleware

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 48


MapReduce as Middleware

• “A simple and powerful interface that enables automatic parallelization


and distribution of large-scale computations, combined with an
implementation of this interface that achieves high performance on
large clusters of commodity PCs.”
• More simply, MapReduce is a parallel programming model
• Today we have a number of implementations following this paradigm
o originally developed internally at Google (initial white paper, 2004)
o first open source implementation was Apache Hadoop in 2005
o we will be working with Apache Spark, initially created in 2014
o but there is a lot more…

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 49


MapReduce Overview

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 50


MapReduce Programming Model

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 51


Programming Models -- von Neumann model

• Execute a stream of instructions (machine code)


• These instructions can specify
o arithmetic operations
o data addresses
o next instruction to execute

• Complexity
o track billions of data locations and millions of instructions

• Manage with
o modular design
o high-level programming languages

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 55


Programming Models – Parallel Programming Models

• Message passing
o independent tasks encapsulating local data
o tasks interact by exchanging messages

• Shared memory
o tasks share a common address space
o tasks interact by reading and writing this space asynchronously

• Data parallelisation
o tasks execute a sequence of independent operations
o data usually evenly partitioned across tasks
o also referred to as “embarrassingly parallel”

• Challenge: unacceptable ease of use vs efficiency trade-offs


Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 56
Programming Model – MapReduce

• Process data using a limited set of functions, map() and reduce()


• The map() function is called on every item in the input and emits a
series of intermediate key/value pairs
• All values associated with a given key are grouped together
• The reduce() function is called on every unique key, and its value list,
and emits a value that is added to the output
• Benefits:
o limited language expressivity allows for very efficient execution
o specialised runtime engines offer high scalability and performance

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 57


Success = Programming Model + Engine
• Simple control flow makes dependencies evident
• Engine can automate scheduling of tasks and optimization
o map & reduce for different keys is embarrassingly parallel
o pipeline between mappers & reducers is evident

• map() and reduce() are pure functions


o engine can re-run them to get the same answer
o in the case of failure, or to use idle resources toward faster completion

• No need to worry about typical concurrency problems


o data races, deadlocks, and so on since there is no shared state

• The execution engine itself is very complex


o but it only needs to be created once!
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 58
MapReduce: Programming Model

M <How,1>
<now,1>
<brown,1> <How,1 1>
How now brown 1
M <cow,1> <now,1 1> R cow 1
Brown cow <How,1>
<brown,1> does 1
<does,1> How 2
<cow,1>
M <it,1>
<does,1> R it 1
How does <work,1> now 2
It work now <now,1> <it,1>
<work,1> work 1
M Reduce
MapReduce
Framework
Map
Input Output

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 59


This is Automatic!
More Examples

Design algorithms to calculate…


• … the largest integer

• … the average of all the integers

• … the same set but with each integer appearing only once

(in all cases assume input is a large set of integers)

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 62


MapReduce Benefits

• Greatly simplifies challenges of parallel programming


• Reduces synchronization complexity
• Automatically partitions data
• Provides failure transparency
• Handles load balancing
• Limited but practical
o approximately 1000 Google MapReduce jobs
were running everyday within the first year (2004)

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 63


The MapReduce Engine

No Free Lunch???
The engine will automatically so the following:
• partition the input data
• schedule the program execution across a set of machines
(the mappers, the reducers, the sorters and the groupers)
• handle machine failures & restart computations
• manage inter-machine communication

Programmer’s Job
As a programmer, you only need to say what to do on each piece of
data – for the mapping part, map() – and what to do on each piece of
grouped value – for the reducing part, reduce()!

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 64


Scale vs Failure Tolerance

• (Source: Indranil Gupta, Lectures 2-3 distributed systems CS425)


o Facebook (GigaOm): 30K machines in 2009 → 60K in 2010 → 180K 2012
o Microsoft (NYTimes, 2008): 150K, growth rate of 10K per month
o Yahoo! 2009: 100K machines, split into clusters of 4000
o eBay (2012): 50K machines
o HP (2012): 380K machines
o Google (2011, Data Center Knowledge): 900K machines (currently over 1M)

• At this size, failure is the norm


o rate of failure one machine 10 years
o for 120 servers the mean time to failure of next machine is 1 month
o for 12000 servers MTTF is 7.2 hours

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 65


Task Execution

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 66


Chunk Servers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Data Split
Data is stored in “chunk servers” (distributed and redundant)
And the chunk servers can also be doing computations…
But then we need to organize computations in ways that they can
work on pieces of data, for example Map-Reduce algorithms!

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 67


Overall Execution of Map-Reduce

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 68


More MapReduce Examples

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 69


MapReduce Use Case 1: Map Only

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 70


MapReduce Use Case 2: Filtering and Accumulation

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 71


MapReduce Use Case 3: Database Join

A JOIN is a means for combining fields from two tables by


using values common to each. For example, for each
employee, find the department they work in…
JOIN
Employee Table Predicate: EMPLOYEE.DepID = DEPARTMENT.DepID
Department Table
LastName DepartmentID DepartmentID DepartmentName
Rafferty 31
JOIN RESULT 31 Sales
Jones 33
33 Engineering
Steinberg 33 LastName DepartmentName
34 Clerical
Robinson 34 Rafferty Sales
35 Marketing
Smith 34 Jones Engineering
Steinberg Engineering
… …
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 72
MapReduce Use Case 3 – Database Join
• Problem is massive lookups
o given (URL, ID) and (URL, doc_content) pairs produce (URL, ID, doc_content)

• Input stream is both (URL, ID) and (URL, doc_content) lists


o in: (http://del.icio.us/, 0), (http://digg.com/, 1), …
o in: (http://del.icio.us/, <html0>), (http://digg.com/, <html1>), …

• Map simply passes input along…


• Shuffle & sort on URL (group ID & doc_content for URL together)
o (http://del.icio.us/, 0), (http://del.icio.us/, <html0>), (http://digg.com/, <html1>), (http://digg.com/, 1), …

• Reduce outputs result stream of (ID, doc_content) pairs


o In: (http://del.icio.us/, [0, html0]), (http://digg.com/, [html1, 1]), …
o Out: (http://del.icio.us/, 0, <html0>), (http://digg.com/, 1, <html1>), …
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 73
MapReduce Use Case 4: Reverse graph edge directions
(& output in node order)
• Input example: adjacency list of graph (3 nodes and 4 edges)
- (3, [1, 2]) (1, [3])
- (1, [2, 3])  (2, [1, 3]) 1 2 1 2
- (3, [1])

• Sort node-ids in the output
3 3
o even though MapReduce only sorts on keys?

• MapReduce format
o Input: (3, [1, 2]), (1, [2, 3])
o Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]) {reverse edge direction}
o Out: (1,[3]) (2, [1, 3]) (3, [[1])

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 74


MapReduce Use Case 5: Inverted Indexing Preliminaries

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 75


Using MapReduce to Construct Indexes: A Simple Approach

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 76


Inverted Index: Data flow
Foo
Foo map output
contains: Foo
much: Foo Reduced output
This page contains page : Foo
so much text so : Foo
text: Foo contains: Foo, Bar
This : Foo much: Foo
My: Bar
page : Foo, Bar
so : Foo
Bar text: Foo, Bar
Bar map output
This : Foo
contains: Bar
too: Bar
My: Bar
My page contains page : Bar
text too text: Bar
too: Bar

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 77


Processing Flow Optimization
• A more detailed analysis of processing flow
o map: (docid1, content1)  (t1, docid1) (t2, docid1) …

• Shuffle by t, prepared for map-reducer communication


• Sort by t, conducted in a reducer machine
o (t5, docid1) (t4, docid3) …  (t4, docid3) (t4, docid1) (t5, docid1) …

• Reduce: (t4, [docid3 docid1 …])  (t, ilist)


o docid: a unique integer
o t: a term, e.g., “apple”
o ilist: a complete inverted list

• but a) inefficient, b) docids are sorted in reducers, and c) assumes ilist of


a word fits in memory
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 78
Using Combine () to Reduce Communication

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 79


Using MapReduce to Construct Indexes

Inverted List Inverted


Documents Fragments Lists
Parser /
A-F
: Indexer Merger

Parser /
: Indexer G-P
: Merger
: :
Parser /
:
: Indexer :
Q-Z
Merger

Map/Combine Shuffle/Sort Reduce

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 80


Construct Partitioned Indexes

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 81


Generate Partitioned Index

Inverted List Inverted


Documents Fragments Lists
Parser /
Partition
: Indexer Merger

Parser /
: Indexer Partition
: Merger
: :
Parser /
:
: Indexer :
Partition
Merger

Map/Combine Shuffle/Sort Reduce

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 82


MapReduce Use Case 6: PageRank
PageRank

 Model page reputation on the web


n
PR(ti )
PR( x )  (1  d )  d 
i 1 C (ti )
 i=1,n lists all parents of page x. 0.4 0.2
0.2
 PR(x) is the page rank of each page.
0.2
 C(t) is the out-degree of t. 0.2
0.4 0.4
 d is a damping factor .

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 84


Computing PageRank Iteratively

Start with seed


PageRank values

Each target page adds up Each page distributes


“credit” from multiple in- PageRank “credit” to all
bound links to compute PRi+1 pages it points to.

 Effects at each iteration is local. i+1th iteration depends only on ith iteration
 At iteration i, PageRank for individual nodes can be computed independently

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 85


PageRank using MapReduce

Map: distribute PageRank “credit” to link targets

Reduce: gather up PageRank “credit” from


multiple sources to compute new PageRank value

Iterate until
convergence
Source of Image: Lin 2008

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 86


PageRank Calculation: Preliminaries

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 87


PageRank: Score Distribution and Accumulation

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 88


PageRank: Database Join to associate out-links with score

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 89


MapReduce

Summary
• Programming model applicable to many different tasks
• Simple to conceptualise and understand
• Powerful engines taking care of technical details
• Widely used in practice
• Powers a lot of Big Data hype

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 90


MapReduce Programs In Google Source Tree at end of 2004

Example uses: distributed sort web link-graph reversal

distributed grep
term-vector / host web access log inverted index
stats construction
document clustering machine learning statistical machine
translation
... ... ...

Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 91


Questions?

You might also like