You are on page 1of 18

Cray-1 (1976): the world’s most

expensive love seat

15-210: Parallelism in the Real World


•  Types of paralellism
•  Parallel Thinking
•  Nested Parallelism
•  Examples (Cilk, OpenMP, Java Fork/Join)
•  Concurrency

15-210 Page1 15-210 2

Data Center: Hundred’s of


Since 2005: Multicore computers
thousands of computers

15-210 3 15-210 4

1

Xeon Phi: Knights Landing Xeon Phi: Knights Landing
72 Cores, x86, 500Gbytes of memory bandwidth 72 Cores, x86, 500Gbytes of memory bandwidth
Summer 2015 Summer 2015

15-210 Page5 15-210 Page6

Moore’s Law Moore’s Law and Performance

15-210 7 15-210 8

2

64 core blade servers ($6K)
(shared memory)

x4=

15-210 Andrew Chien, 2008 Page 9 15-210 Page 10

1024 “cuda” cores

15-210 11 15-210 12

3

Parallel Hardware
Many forms of parallelism
–  Supercomputers: large scale, shared memory
–  Clusters and data centers: large-scale,
distributed memory
–  Multicores: tightly coupled, smaller scale
–  GPUs, on chip vector units
–  Instruction-level parallelism
Parallelism is important in the real world.

15-210 13 15-210 14

Key Challenge: Software


15-210 Approach
(How to Write Parallel Code?)
At a high-level, it is a two step process: Enable parallel thinking by raising abstraction level
–  Design a work-efficient, low-span parallel
algorithm I. Parallel thinking: Applicable to many machine models and
–  Implement it on the target hardware programming languages
In reality: each system required different code because
programming systems are immature II. Reason about correctness and efficiency of algorithms
–  Huge effort to generate efficient parallel code. and data structures.
•  Example: Quicksort in MPI is 1700 lines of code,
and about the same in CUDA
–  Implement one parallel algorithm: a whole thesis.
Take 15-418 (Parallel Computer Architecture and Prog.)
15-210 15 15-210 16

4

Quicksort from
Parallel Thinking Aho-Hopcroft-Ullman (1974)
Recognizing true dependences: unteach sequential
procedure QUICKSORT(S):
programming.
if S contains at most one element then return S
else
Parallel algorithm-design techniques
begin
–  Operations on aggregates: map/reduce/scan choose an element a randomly from S;
–  Divide & conquer, contraction let S1, S2 and S3 be the sequences of
–  Viewing computation as DAG (based on elements in S less than, equal to,
dependences) and greater than a, respectively;
return (QUICKSORT(S1) followed by S2
Cost model based on work and span followed by QUICKSORT(S3))
end
15-210 17 15-210 Page 18

Quicksort from Sedgewick (2003) Styles of Parallel Programming


public void quickSort(int[] a, int left, int right) { Data parallelism/Bulk Synchronous/SPMD
int i = left-1; int j = right; Nested parallelism : what we covered
if (right <= left) return;
while (true) { Message passing
while (a[++i] < a[right]);
while (a[right] < a[--j]) Futures (other pipelined parallelism)
if (j==left) break;
if (i >= j) break;
General Concurrency
swap(a,i,j); }
swap(a, i, right);
quickSort(a, left, i - 1);
quickSort(a, i+1, right); }

15-210 Page 19 15-210 Page20

5

Nested Parallelism Serial Parallel DAGs
Nested Parallelism = Dependence graphs of nested parallel computations are
arbitrary nesting of parallel loops + fork-join series parallel
–  Assumes no synchronization among parallel
tasks except at joint points.
–  Deterministic if no race conditions

Advantages:
–  Good schedulers are known Two tasks are parallel if not reachable from each other.
–  Easy to understand, debug, and analyze costs A data race occurs if two parallel tasks are involved in a
–  Purely functional, or imperative…either works race if they access the same location and at least one
is a write.
15-210 Page21 15-210 Page22

!
!

Nested Parallelism: parallel loops Nested Parallelism: fork-join


cilk_for (i=0; i < n; i++) ! Cilk cobegin { !
Dates back to the 60s. Used in
B[i] = A[i]+1;! S1;! dialects of Algol, Pascal
! S2;}!
Microsoft TPL
Parallel.ForEach(A, x => x+1);! ! Java fork-join framework
(C#,F#)
! coinvoke(f1,f2)! Microsoft TPL (C#,F#)
! Parallel.invoke(f1,f2)!
B = {x + 1 : x in A}! Nesl, Parallel Haskell !
! #pragma omp sections!
#pragma omp for ! OpenMP { ! OpenMP (C++, C, Fortran, …)
for (i=0; i < n; i++) 
 #pragma omp section! !
B[i] = A[i] + 1;! S1;!
!
! #pragma omp section!
S2;!
!
!
15-210 ! Page23 }!
15-210 ! Page24

! !
! !
!

6

Nested Parallelism: fork-join Cilk vs. what we’ve covered
spawn S1;! ! ML: val (a,b) = par(fn () => f(x), fn () => g(y))
S2;! cilk, cilk+!
sync;! ! Psuedocode: val (a,b) = (f(x) || g(y))
! !
Cilk: cilk_spawn a = f(x);
!
!
b = g(y); Fork Join
(exp1 || exp2)!
!
Various functional
cilk_sync;
languages!
! !
plet! !
x = exp1! Various dialects of ML: S = map f A
y = exp2! ML and Lisp!
Psuedocode: S = <f x : x in A>
Map
in! !
exp3! ! Cilk: cilk_for (int i = 0; i < n; i++)
!
15-210 ! Page25 15-210 S[i] = f(A[i]) Page26
!
!
!
!
!

Cilk vs. what we’ve covered Example Cilk


ML: S = tabulate f n int fib (int n) {
if (n<2) return (n);
Psuedocode: S = <f i : i in <0,..n-1>> Tabulate else {
int x,y;
Cilk: cilk_for (int i = 0; i < n; i++) x = cilk_spawn fib(n-1);
S[i] = f(i) y = cilk_spawn fib(n-2);
cilk_sync;
return (x+y);
}
}

15-210 Page27 15-210 Page28

7

Example OpenMP:
Numerical Integration The C code for Approximating PI
Mathematically, we know that:
1
4.0 4.0
∫ (1+x2)
dx = π
0

We can approximate the


F(x) = 4.0/(1+x2)

integral as a sum of
2.0 rectangles:
N

∑ F(x )Δx ≈ πi
i=0

where each rectangle has


1.0
0.0
X width Δx and height F(xi) at
15-210 29 15-210 30
the middle of interval i.

The C/openMP code for Approx. PI


Example : Java Fork/Join
class Fib extends FJTask {!
volatile int result; // serves as arg and result!
int n;!
Fib(int _n) { n = _n; }!
!
public void run() {!
if (n <= 1) result = n;!
else if (n <= sequentialThreshold) number = seqFib(n);!
else {!
Fib f1 = new Fib(n - 1); !
Fib f2 = new Fib(n - 2);!
coInvoke(f1, f2); !
result = f1.result + f2.result;!
}!
}!
}!
15-210 31 15-210 Page32

8

Cost Model (General)
Combining costs (Nested Parallelism)
Compositional:
Combining for parallel for:

Work : total number of operations pfor (i=0; i<n; i++)


–  costs are added across parallel calls f(i);
n −1

Span : depth/critical path of the computation W pexp (pfor ...) = ∑W exp (f(i)) work
–  Maximum span is taken across forked calls i=0

span
n −1
Dpexp (pfor ...) = max i=0 Dexp (f(i))
Parallelism = Work/Span
–  Approximately # of processors that can be €
effectively used. €
15-210 Page33 15-210 34

How do the problems do on


Why Work and Span Tseq/T32
31.6
a modern multicore
21.6
11.2 32"
Simple measures that give us a good sense of
10
9
T1/T32"
14.5
28"
efficiency (work) and scalability (span).
15
17
11.7 Tseq/T32"
17
24"
Can schedule in O(W/P + D) time on P processors.
18
15

20"
This is within a constant factor of optimal. 16"
Goals in designing an algorithm 12"

1.  Work should be about the same as the 8"


sequential running time. When it matches 4"
asymptotically we say it is work efficient. 0"
2.  Parallelism (W/D) should be polynomial.
"Sp Rem t"
ax nin al"

Sp pen e"

dt ng"F t"
De irst" st"

ia ay"T h"

."

g ."

se s"

x"A "
y"
N "
y
xV
Ne e"R ng

r
r

Su bod

rra
de "Tre

t"N te
te So

re

Sp hbo
Br ann d."S

un ear

"M
o

ria
O(n1/2) is probably good enough

es y"In
o
g

ei


a

ar
"

i
n

F
a
ica

l
h"

ng
" In

la
15-210 35 15-210

ar
36
pl

in

ea
Du

Tr
M

9

Styles of Parallel Programming Parallelism vs. Concurrency
Data parallelism/Bulk Synchronous/SPMD
"   Parallelism: using multiple processors/cores
Nested parallelism : what we covered running at the same time. Property of the machine
Message passing "   Concurrency: non-determinacy due to interleaving
threads. Property of the application.
Futures (other pipelined parallelism)
General Concurrency Concurrency
sequential concurrent
Traditional Traditional
serial
programming OS
Parallelism
Deterministic General
parallel
parallelism parallelism

15-210 Page37 15-210 38

Concurrency : Stack Example 1 Concurrency : Stack Example 1


struct link {int v; link* next;} struct link {int v; link* next;}
A
struct stack { struct stack {
A
link* headPtr; H link* headPtr;
void push(link* a) { void push(link* a) { H
a->next = headPtr; a->next = headPtr;
A
headPtr = a; } headPtr = a; } B
link* pop() { H link* pop() {
link* h = headPtr; link* h = headPtr;
if (headPtr != NULL) if (headPtr != NULL)
headPtr = headPtr->next; headPtr = headPtr->next;
return h;} return h;}
} }

15-210 39 15-210 40

10

Concurrency : Stack Example 1 Concurrency : Stack Example 1
struct link {int v; link* next;} struct link {int v; link* next;}
struct stack { struct stack {
link* headPtr; A link* headPtr; A
void push(link* a) { H void push(link* a) { H
a->next = headPtr; a->next = headPtr;
headPtr = a; } B headPtr = a; } B
link* pop() { link* pop() {
link* h = headPtr; link* h = headPtr;
if (headPtr != NULL) if (headPtr != NULL)
headPtr = headPtr->next; headPtr = headPtr->next;
return h;} return h;}
} }

15-210 41 15-210 42

CAS Concurrency : Stack Example 2


bool CAS(ptr* addr, ptr a, ptr b) { struct stack {
atomic { link* headPtr;
if (*addr == a) { void push(link* a) { A
*addr = b; do {
return 1; link* h = headPtr;
} else a->next = h;
H
return 0; while (!CAS(&headPtr, h, a)); }
}
} link* pop() {
do {
link* h = headPtr;
A built in instruction on most processors: if (h == NULL) return NULL;
link* nxt = h->next;
CMPXCHG8B – 8 byte version for x86 while (!CAS(&headPtr, h, nxt))}
CMPXCHG16B – 16 byte version return h;}
}
15-210 Page43 15-210 44

11

Concurrency : Stack Example 2 Concurrency : Stack Example 2
struct stack { struct stack {
link* headPtr; link* headPtr;
void push(link* a) { A void push(link* a) { A
do { do {
link* h = headPtr; link* h = headPtr;
a->next = h; H a->next = h; H
while (!CAS(&headPtr, h, a)); } while (!CAS(&headPtr, h, a)); }
link* pop() { link* pop() { B
do { do {
link* h = headPtr; link* h = headPtr;
if (h == NULL) return NULL; if (h == NULL) return NULL;
link* nxt = h->next; link* nxt = h->next;
while (!CAS(&headPtr, h, nxt))} while (!CAS(&headPtr, h, nxt))}
return h;} return h;}
} }
15-210 45 15-210 46

Concurrency : Stack Example 2 Concurrency : Stack Example 2


struct stack { struct stack {
link* headPtr; link* headPtr;
void push(link* a) { A void push(link* a) { A
do { do {
link* h = headPtr; link* h = headPtr;
a->next = h;
H a->next = h;
H
while (!CAS(&headPtr, h, a)); } while (!CAS(&headPtr, h, a)); } X
link* pop() { B link* pop() { B
do { do {
link* h = headPtr; link* h = headPtr;
if (h == NULL) return NULL; if (h == NULL) return NULL;
link* nxt = h->next; link* nxt = h->next;
while (!CAS(&headPtr, h, nxt))} while (!CAS(&headPtr, h, nxt))}
return h;} return h;}
} }
15-210 47 15-210 48

12

Concurrency : Stack Example 2’ Concurrency : Stack Example 3
P1 : x = s.pop(); y = s.pop(); s.push(x); struct link {int v; link* next;}
P2 : z = s.pop(); struct stack {
link* headPtr;
void push(link* a) {
Before: A B C atomic {
a->next = headPtr;
headPtr = a; }}
After: B C P2: h = headPtr;
link* pop() {
P2: nxt = h->next; atomic {
P1: everything link* h = headPtr;
The ABA problem P2: CAS(&headPtr,h,nxt) if (headPtr != NULL)
headPtr = headPtr->next;
Can be fixed with counter and 2CAS, but… return h;}}
}
15-210 49 15-210 50

Concurrency : Stack Example 3’ Styles of Parallel Programming


Data parallelism/Bulk Synchronous/SPMD
void swapTop(stack s) {
link* x = s.pop();
link* y = s.pop(); Nested parallelism : what we covered
push(x);
push(y); Message passing
} Futures (other pipelined parallelism)
General Concurrency
Queues are trickier than stacks.

15-210 51 15-210 Page52

13

Futures: Example Futures: Example
fun quickSort S = let fun quickSort S = let
fun qs([], rest) = rest fun qs([], rest) = rest
| qs(h::T, rest) = | qs(h::T, rest) =
let let
val L1 = filter (fn b => b < a) T val L1 = filter (fn b => b < a) T
val L2 = filter (fn b => b >= a) T val L2 = filter (fn b => b >= a) T
in in
qs(L1,(a::(qs(L2, rest))) qs(L1, future(a::(qs(L2, rest)))
end end

fun filter(f,[]) = [] fun filter(f,[]) = []


| filter(f,h::T) = | filter(f,h::T) =
if f(h) then(h::filter(f,T)) if f(h) then future(h::filter(f,T))
else filter(f,T) else filter(f,T)
in in
qs(S,[]) qs(S,[])
15-210 Page53 15-210 Page54
end end

Quicksort: Nested Parallel Quicksort: Futures


Parallel Partition and Append

Work = O(n log n)


Work = O(n log n)

Span = O(lg2 n)
Span = O(n)
15-210 55 15-210 56

14

Styles of Parallel Programming Xeon 7500
Data parallelism/Bulk Synchronous/SPMD Memory: up to 1 TB
* Nested parallelism : what we covered
4 of these
Message passing
24 MB L3 24 MB
* Futures (other pipelined parallelism)
8 of these 8 of these
* General Concurrency
128 KB 128 KB L2 128 KB 128 KB

32 KB 32 KB L1 32 KB 32 KB

P P P P

15-210 Page57 15-210 Page58

Question
Greedy Schedules
How do we get nested parallel programs to work well
“Speedup versus Efficiency in Parallel Systems”,
on a fixed set of processors? Programs say
Eager, Zahorjan and Lazowska, 1989
nothing about processors.
For any greedy schedule:
Answer: good schedulers W PW
Efficiency = ≥
TP W + D( P − 1)
W
Parallel Time = TP ≤ +D
P

15-210 Page59 15-210 60

15

Types of Schedulers Breadth First Schedules
Bread-First Schedulers
Most naïve schedule. Used by most implementations
Work-stealing Schedulers
of P-threads.
Depth-first Schedulers

O(n3) tasks

Bad space usage, bad locality


15-210 Page61 15-210 62

Work Stealing Work Stealing


P1 P2 P3 P4
old Tends to schedule “sequential blocks” of tasks
Local
work queues new

push new jobs on “new” end


pop jobs from “new” end
If processor runs out of work, then “steal” from
another “old” end = steal
Each processor tends to execute a sequential part of
the computation.
15-210 63 15-210 64

16

Work Stealing Theory Work Stealing Practice
For strict computations Used in Cilk Scheduler
Blumofe and Leiserson, 1999 –  Small overheads because common case of
# of steals = O(PD) pushing/popping from local queue can be made
Space = O(PS1) S1 is the sequential space fast (with good data structures and compiler
help).
Acar, Blelloch and Blumofe, 2003
–  No contention on a global queue
# of cache misses on distributed caches
M1+ O(CPD) –  Has good distributed cache behavior
M1 = sequential misses, C = cache size –  Can indeed require O(S1P) memory
Used in X10 scheduler, and others

15-210 65 15-210 66

Parallel Depth First Schedules (P-


“Premature task” in P-DFS
DFS)
List scheduling based on Depth-First ordering A running task is premature if there is an earlier
sequential task that is not complete
1 2 processor 2 processor
schedule 1 1 schedule
2 6
1 2 6 2 6 1
3 4 7 8 2, 6 2, 6
3, 4 3 4 7 8 3 4 7 8 3, 4
5 9 5, 7 5, 7
8 5 9 5 9 8
10
9 10 10 9
10 10
For strict computations a shared stack
implements a P-DFS = premature
15-210 67 15-210 68

17

P-DFS Theory P-DFS Practice
Blelloch, Gibbons, Matias, 1999 Experimentally uses less memory than work stealing
For any computation: and performs better on a shared cache.
Premature nodes at any time = O(PD) Requires some “coarsening” to reduce overheads
Space = S1 + O(PD)
Blelloch and Gibbons, 2004
With a shared cache of size C1 + O(PD) we have Mp =
M1

15-210 69 15-210 70

Conclusions
-  lots of parallel languages
-  lots of parallel programming styles
-  high-level ideas cross between languages and
styles
-  scheduling is important

15-210 Page71

18

You might also like