Professional Documents
Culture Documents
Slawomir Nowaczyk
Slawomir
Slawomir Nowaczyk,
Nowaczyk, Halmstad
Halmstad University
University BDPP, Lecture 1 SlideSlide
1
Big Data Parallel Programming Course
Goals
Processing vast amounts of data is at the core of data mining, deep
learning and real-time autonomous decision making. All these are in
turn at the core of modern artificial intelligence applications. Data
can reside more or less permanently in the cloud and accessed via
distributed file systems and/or be streamed in real-time from
multiple sensors at very high rates.
4 lectures, 2 labs & a project 1 lecture & 1 lab 2 lectures & 2 labs
Sepideh Pashami
Slawomir Nowaczyk
Urban Bilstrup
Abdallah Alabdallah
Peyman Mashhadi Mahmoud Rahat
Examination
To pass the course you need to pass:
• the labs (2.5 credits, Pass/Fail) – work in pairs, reports in Jupyter.
• the project (5 credits, 5-4-3/Fail) – individual work; design and
implement a program, write a report & pass oral exam [which
includes a presentation of your project and answering questions
both about your solution as well as general course material]
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 14
Questions?
Big Data Parallel Programming
Part 2: Parallel Programming
Slawomir Nowaczyk
Why? How?
The only way to scale up! Divide and conquer!
Task parallelism
Different tasks running on
the same data in parallel
General-purpose frameworks
The course includes modern techniques, methods and tools for
distributed storage (e.g., distributed and fault-tolerant key-value
tables including replication and coordination mechanisms) as well as
for distributed processing (e.g., frameworks such as MapReduce
and Spark). It covers both static and streamed massive data. As a
principle, all kinds of algorithms and data types are supported.
Specialised frameworks
In addition, the course includes concepts, methods and tools for
parallel programming for homogenous clusters, in particular CUDA
library for GPUs. As a principle, this kind of processing is highly
specialised and supports only a limited set of operations.
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 20
Parallel Programming is hard
Assumption
Output of one task is the input to the next
Benefits
All the tasks can run in parallel
Limitations
Limited by the highest-latency element
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 24
Programming the Pipeline
• Matrices / tensors
o a single class of problems
o common & important
o worth creating dedicated
specialised hardware
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 27
Spark
Why? – Process Big Data in a flexible way, with arbitrary code
How? – Build upon ideas from functional programming
The trick? – Forbid global state & automate worker management
CUDA
Why? – Make very repetitive computations on data run very fast
How? – Only allow simple math operations on tensors/matrices
The trick? – Use dedicated hardware, carefully designed to the task
Slawomir Nowaczyk
Goals
• Introduce a little bit of Python programming
• Discuss the difference between having “small” or “big” data
• Design an algorithm for counting the number of word occurrences
i.e., how many times each word is used in a given text
An Algorithm
• Read all the text from the file, and split into words
• Count the number of occurrences of each unique word
• Write the output to a file or a screen
each unique word & the number of times it occurs in the file
Goals
• Present an algorithm that can be parallelised
• A little bit more Python programming
• Discuss the abstract structure of the new algorithm
• Consider other problems that exhibit similar structure
i.e., that can be solved with the same class of algorithms
• Motivate the need for general-purpose distributed infrastructure
Divide-and-Conquer
• Divide file into parts, read text from each part, and split into words
• Count the number of occurrences of unique words in each part
• Merge all the partial results together (somehow)
• Write the output to a file or a screen
Sort the resulting list of [(’and’, 1), (’are’, 1), (’are’, 1), (’more’, 1),
pairs in the word order (’some’, 1), (’these’, 1), (’these’, 1),(’words’, 1),
(’words’, 1)]
Group all “1”s for each,
i.e., now pair each word [(’and’, [1]), (’are’, [1, 1]), (’more’, [1]),
(’some’, [1]), (’these’, [1, 1]), (’words’, [1, 1])]
with a list of “1”s
[(’and’, 1), (’are’, 2), (’more’, 1), (’some’, 1),
Sum up all the numbers in (’these’, 2), (’words’, 2)]
the list for each word
alle steps c
pairs in the word order (’some’, 1), (’these’, 1), (’these’, 1),(’words’, 1),
(’words’, 1)]
Group all “1”s for each,
li se an b
i.e., now pair each word
d!
[(’and’, [1]), (’are’, [1, 1]), (’more’, [1]),
with a list of “1”s e
(’some’, [1]), (’these’, [1, 1]), (’words’, [1, 1])]
Map
Apply the same function to all the elements in the input list.
In the example, it was “pair with 1” – that’s a function. This can be
done in parallel, over chunks of the input, equal or different sizes…
Then, instead of one list, we will get several lists!
Reduce
The last part is called reducing: we reduce a list of values to a value
by applying an accumulating function: plus This can also be done in
parallel, we can take parts of the list and reduce those words.
Instead of one output file we might get several output files!
Slawomir Nowaczyk, Halmstad University BDPP, Lecture 1 Slide 42
Big Data Parallel Programming
Part 4: MapReduce Paradigm
Slawomir Nowaczyk
• Complexity
o track billions of data locations and millions of instructions
• Manage with
o modular design
o high-level programming languages
• Message passing
o independent tasks encapsulating local data
o tasks interact by exchanging messages
• Shared memory
o tasks share a common address space
o tasks interact by reading and writing this space asynchronously
• Data parallelisation
o tasks execute a sequence of independent operations
o data usually evenly partitioned across tasks
o also referred to as “embarrassingly parallel”
M <How,1>
<now,1>
<brown,1> <How,1 1>
How now brown 1
M <cow,1> <now,1 1> R cow 1
Brown cow <How,1>
<brown,1> does 1
<does,1> How 2
<cow,1>
M <it,1>
<does,1> R it 1
How does <work,1> now 2
It work now <now,1> <it,1>
<work,1> work 1
M Reduce
MapReduce
Framework
Map
Input Output
• … the same set but with each integer appearing only once
No Free Lunch???
The engine will automatically so the following:
• partition the input data
• schedule the program execution across a set of machines
(the mappers, the reducers, the sorters and the groupers)
• handle machine failures & restart computations
• manage inter-machine communication
Programmer’s Job
As a programmer, you only need to say what to do on each piece of
data – for the mapping part, map() – and what to do on each piece of
grouped value – for the reducing part, reduce()!
Data Split
Data is stored in “chunk servers” (distributed and redundant)
And the chunk servers can also be doing computations…
But then we need to organize computations in ways that they can
work on pieces of data, for example Map-Reduce algorithms!
• MapReduce format
o Input: (3, [1, 2]), (1, [2, 3])
o Intermediate: (1, [3]), (2, [3]), (2, [1]), (3, [1]) {reverse edge direction}
o Out: (1,[3]) (2, [1, 3]) (3, [[1])
Parser /
: Indexer G-P
: Merger
: :
Parser /
:
: Indexer :
Q-Z
Merger
Parser /
: Indexer Partition
: Merger
: :
Parser /
:
: Indexer :
Partition
Merger
Effects at each iteration is local. i+1th iteration depends only on ith iteration
At iteration i, PageRank for individual nodes can be computed independently
Iterate until
convergence
Source of Image: Lin 2008
Summary
• Programming model applicable to many different tasks
• Simple to conceptualise and understand
• Powerful engines taking care of technical details
• Widely used in practice
• Powers a lot of Big Data hype
distributed grep
term-vector / host web access log inverted index
stats construction
document clustering machine learning statistical machine
translation
... ... ...