Cray-1 (1976) : The World's Most Expensive Love Seat

Cray-1 (1976): the world’s most
expensive love seat
15-210: Parallelism in the Real World

•  Types of paralellism
•  Parallel Thinking
•  Nested Parallelism
•  Examples (Cilk, OpenMP, Java Fork/Join)
•  Concurrency
15-210 Page1 15-210 2
Data Center: Hundred’s of

Since 2005: Multicore computers
thousands of computers
15-210 3 15-210 4
1

Xeon Phi: Knights Landing Xeon Phi: Knights Landing
72 Cores, x86, 500Gbytes of memory bandwidth 72 Cores, x86, 500Gbytes of memory bandwidth
Summer 2015 Summer 2015
15-210 Page5 15-210 Page6
Moore’s Law Moore’s Law and Performance
15-210 7 15-210 8
2

64 core blade servers ($6K)
(shared memory)
x4=
15-210 Andrew Chien, 2008 Page 9 15-210 Page 10
1024 “cuda” cores
15-210 11 15-210 12
3

Parallel Hardware
Many forms of parallelism
–  Supercomputers: large scale, shared memory
–  Clusters and data centers: large-scale,
distributed memory
–  Multicores: tightly coupled, smaller scale
–  GPUs, on chip vector units
–  Instruction-level parallelism
Parallelism is important in the real world.
15-210 13 15-210 14
Key Challenge: Software

15-210 Approach
(How to Write Parallel Code?)
At a high-level, it is a two step process: Enable parallel thinking by raising abstraction level
–  Design a work-efficient, low-span parallel
algorithm I. Parallel thinking: Applicable to many machine models and
–  Implement it on the target hardware programming languages
In reality: each system required different code because
programming systems are immature II. Reason about correctness and efficiency of algorithms
–  Huge effort to generate efficient parallel code. and data structures.
•  Example: Quicksort in MPI is 1700 lines of code,
and about the same in CUDA
–  Implement one parallel algorithm: a whole thesis.
Take 15-418 (Parallel Computer Architecture and Prog.)
15-210 15 15-210 16
4

Quicksort from
Parallel Thinking Aho-Hopcroft-Ullman (1974)
Recognizing true dependences: unteach sequential
procedure QUICKSORT(S):
programming.
if S contains at most one element then return S
else
Parallel algorithm-design techniques
begin
–  Operations on aggregates: map/reduce/scan choose an element a randomly from S;
–  Divide & conquer, contraction let S1, S2 and S3 be the sequences of
–  Viewing computation as DAG (based on elements in S less than, equal to,
dependences) and greater than a, respectively;
return (QUICKSORT(S1) followed by S2
Cost model based on work and span followed by QUICKSORT(S3))
end
15-210 17 15-210 Page 18
Quicksort from Sedgewick (2003) Styles of Parallel Programming

public void quickSort(int[] a, int left, int right) { Data parallelism/Bulk Synchronous/SPMD
int i = left-1; int j = right; Nested parallelism : what we covered
if (right <= left) return;
while (true) { Message passing
while (a[++i] < a[right]);
while (a[right] < a[--j]) Futures (other pipelined parallelism)
if (j==left) break;
if (i >= j) break;
General Concurrency
swap(a,i,j); }
swap(a, i, right);
quickSort(a, left, i - 1);
quickSort(a, i+1, right); }
15-210 Page 19 15-210 Page20
5

Nested Parallelism Serial Parallel DAGs
Nested Parallelism = Dependence graphs of nested parallel computations are
arbitrary nesting of parallel loops + fork-join series parallel
–  Assumes no synchronization among parallel
tasks except at joint points.
–  Deterministic if no race conditions
Advantages:
–  Good schedulers are known Two tasks are parallel if not reachable from each other.
–  Easy to understand, debug, and analyze costs A data race occurs if two parallel tasks are involved in a
–  Purely functional, or imperative…either works race if they access the same location and at least one
is a write.
15-210 Page21 15-210 Page22
!
!
Nested Parallelism: parallel loops Nested Parallelism: fork-join

cilk_for (i=0; i < n; i++) ! Cilk cobegin { !
Dates back to the 60s. Used in
B[i] = A[i]+1;! S1;! dialects of Algol, Pascal
! S2;}!
Microsoft TPL
Parallel.ForEach(A, x => x+1);! ! Java fork-join framework
(C#,F#)
! coinvoke(f1,f2)! Microsoft TPL (C#,F#)
! Parallel.invoke(f1,f2)!
B = {x + 1 : x in A}! Nesl, Parallel Haskell !
! #pragma omp sections!
#pragma omp for ! OpenMP { ! OpenMP (C++, C, Fortran, …)
for (i=0; i < n; i++)   #pragma omp section! !
B[i] = A[i] + 1;! S1;!
!
! #pragma omp section!
S2;!
!
!
15-210 ! Page23 }!
15-210 ! Page24
! !
! !
!
6

Nested Parallelism: fork-join Cilk vs. what we’ve covered
spawn S1;! ! ML: val (a,b) = par(fn () => f(x), fn () => g(y))
S2;! cilk, cilk+!
sync;! ! Psuedocode: val (a,b) = (f(x) || g(y))
! !
Cilk: cilk_spawn a = f(x);
!
!
b = g(y); Fork Join
(exp1 || exp2)!
!
Various functional
cilk_sync;
languages!
! !
plet! !
x = exp1! Various dialects of ML: S = map f A
y = exp2! ML and Lisp!
Psuedocode: S = <f x : x in A>
Map
in! !
exp3! ! Cilk: cilk_for (int i = 0; i < n; i++)
!
15-210 ! Page25 15-210 S[i] = f(A[i]) Page26
!
!
!
!
!
Cilk vs. what we’ve covered Example Cilk

ML: S = tabulate f n int fib (int n) {
if (n<2) return (n);
Psuedocode: S = <f i : i in <0,..n-1>> Tabulate else {
int x,y;
Cilk: cilk_for (int i = 0; i < n; i++) x = cilk_spawn fib(n-1);
S[i] = f(i) y = cilk_spawn fib(n-2);
cilk_sync;
return (x+y);
}
}
15-210 Page27 15-210 Page28
7

Example OpenMP:
Numerical Integration The C code for Approximating PI
Mathematically, we know that:
1
4.0 4.0
∫ (1+x2)
dx = π
0
We can approximate the

F(x) = 4.0/(1+x2)
integral as a sum of
2.0 rectangles:
N
∑ F(x )Δx ≈ πi
i=0
where each rectangle has

1.0
0.0
X width Δx and height F(xi) at
15-210 29 15-210 30
the middle of interval i.
The C/openMP code for Approx. PI

Example : Java Fork/Join
class Fib extends FJTask {!
volatile int result; // serves as arg and result!
int n;!
Fib(int _n) { n = _n; }!
!
public void run() {!
if (n <= 1) result = n;!
else if (n <= sequentialThreshold) number = seqFib(n);!
else {!
Fib f1 = new Fib(n - 1); !
Fib f2 = new Fib(n - 2);!
coInvoke(f1, f2); !
result = f1.result + f2.result;!
}!
}!
}!
15-210 31 15-210 Page32
8

Cost Model (General)
Combining costs (Nested Parallelism)
Compositional:
Combining for parallel for:
Work : total number of operations pfor (i=0; i<n; i++)

–  costs are added across parallel calls f(i);
n −1
Span : depth/critical path of the computation W pexp (pfor ...) = ∑W exp (f(i)) work
–  Maximum span is taken across forked calls i=0
span
n −1
Dpexp (pfor ...) = max i=0 Dexp (f(i))
Parallelism = Work/Span
–  Approximately # of processors that can be €
effectively used. €
15-210 Page33 15-210 34
How do the problems do on

Why Work and Span Tseq/T32
31.6
a modern multicore
21.6
11.2 32"
Simple measures that give us a good sense of
10
9
T1/T32"
14.5
28"
efficiency (work) and scalability (span).
15
17
11.7 Tseq/T32"
17
24"
Can schedule in O(W/P + D) time on P processors.
18
15
20"
This is within a constant factor of optimal. 16"
Goals in designing an algorithm 12"
1.  Work should be about the same as the 8"

sequential running time. When it matches 4"
asymptotically we say it is work efficient. 0"
2.  Parallelism (W/D) should be polynomial.
"Sp Rem t"
ax nin al"
Sp pen e"
dt ng"F t"
De irst" st"
ia ay"T h"
."
g ."
se s"
x"A "
y"
N "
y
xV
Ne e"R ng
r
r
Su bod
rra
de "Tre
t"N te
te So
re
Sp hbo
Br ann d."S
un ear
"M
o
ria
O(n1/2) is probably good enough
es y"In
o
g
ei
ﬃ
a
ar
"
i
n
F
a
ica
l
h"
ng
" In
la
15-210 35 15-210
ar
36
pl
in
ea
Du
Tr
M
9

Styles of Parallel Programming Parallelism vs. Concurrency
Data parallelism/Bulk Synchronous/SPMD
"   Parallelism: using multiple processors/cores
Nested parallelism : what we covered running at the same time. Property of the machine
Message passing "   Concurrency: non-determinacy due to interleaving
threads. Property of the application.
Futures (other pipelined parallelism)
General Concurrency Concurrency
sequential concurrent
Traditional Traditional
serial
programming OS
Parallelism
Deterministic General
parallel
parallelism parallelism
15-210 Page37 15-210 38
Concurrency : Stack Example 1 Concurrency : Stack Example 1

struct link {int v; link* next;} struct link {int v; link* next;}
A
struct stack { struct stack {
A
link* headPtr; H link* headPtr;
void push(link* a) { void push(link* a) { H
a->next = headPtr; a->next = headPtr;
A
headPtr = a; } headPtr = a; } B
link* pop() { H link* pop() {
link* h = headPtr; link* h = headPtr;
if (headPtr != NULL) if (headPtr != NULL)
headPtr = headPtr->next; headPtr = headPtr->next;
return h;} return h;}
} }
15-210 39 15-210 40
10

struct link {int v; link* next;} struct link {int v; link* next;}
link* headPtr; A link* headPtr; A
void push(link* a) { H void push(link* a) { H
a->next = headPtr; a->next = headPtr;
headPtr = a; } B headPtr = a; } B
link* pop() { link* pop() {
if (headPtr != NULL) if (headPtr != NULL)
headPtr = headPtr->next; headPtr = headPtr->next;
} }
15-210 41 15-210 42
CAS Concurrency : Stack Example 2

bool CAS(ptr* addr, ptr a, ptr b) { struct stack {
atomic { link* headPtr;
if (*addr == a) { void push(link* a) { A
*addr = b; do {
return 1; link* h = headPtr;
} else a->next = h;
H
return 0; while (!CAS(&headPtr, h, a)); }
}
} link* pop() {
do {
link* h = headPtr;
A built in instruction on most processors: if (h == NULL) return NULL;
link* nxt = h->next;
CMPXCHG8B – 8 byte version for x86 while (!CAS(&headPtr, h, nxt))}
CMPXCHG16B – 16 byte version return h;}
}
15-210 Page43 15-210 44
11

link* headPtr; link* headPtr;
void push(link* a) { A void push(link* a) { A
do { do {
a->next = h; H a->next = h; H
while (!CAS(&headPtr, h, a)); } while (!CAS(&headPtr, h, a)); }
link* pop() { link* pop() { B
do { do {
if (h == NULL) return NULL; if (h == NULL) return NULL;
link* nxt = h->next; link* nxt = h->next;
while (!CAS(&headPtr, h, nxt))} while (!CAS(&headPtr, h, nxt))}
} }
15-210 45 15-210 46

link* headPtr; link* headPtr;
void push(link* a) { A void push(link* a) { A
do { do {
a->next = h;
H a->next = h;
H
while (!CAS(&headPtr, h, a)); } while (!CAS(&headPtr, h, a)); } X
link* pop() { B link* pop() { B
do { do {
if (h == NULL) return NULL; if (h == NULL) return NULL;
link* nxt = h->next; link* nxt = h->next;
while (!CAS(&headPtr, h, nxt))} while (!CAS(&headPtr, h, nxt))}
} }
15-210 47 15-210 48
12

Concurrency : Stack Example 2’ Concurrency : Stack Example 3
P1 : x = s.pop(); y = s.pop(); s.push(x); struct link {int v; link* next;}
P2 : z = s.pop(); struct stack {
link* headPtr;
void push(link* a) {
Before: A B C atomic {
a->next = headPtr;
headPtr = a; }}
After: B C P2: h = headPtr;
link* pop() {
P2: nxt = h->next; atomic {
P1: everything link* h = headPtr;
The ABA problem P2: CAS(&headPtr,h,nxt) if (headPtr != NULL)
headPtr = headPtr->next;
Can be fixed with counter and 2CAS, but… return h;}}
}
15-210 49 15-210 50
Concurrency : Stack Example 3’ Styles of Parallel Programming

Data parallelism/Bulk Synchronous/SPMD
void swapTop(stack s) {
link* x = s.pop();
link* y = s.pop(); Nested parallelism : what we covered
push(x);
push(y); Message passing
} Futures (other pipelined parallelism)
General Concurrency
Queues are trickier than stacks.
15-210 51 15-210 Page52
13

Futures: Example Futures: Example
fun quickSort S = let fun quickSort S = let
fun qs([], rest) = rest fun qs([], rest) = rest
| qs(h::T, rest) = | qs(h::T, rest) =
let let
val L1 = filter (fn b => b < a) T val L1 = filter (fn b => b < a) T
val L2 = filter (fn b => b >= a) T val L2 = filter (fn b => b >= a) T
in in
qs(L1,(a::(qs(L2, rest))) qs(L1, future(a::(qs(L2, rest)))
end end
fun filter(f,[]) = [] fun filter(f,[]) = []

| filter(f,h::T) = | filter(f,h::T) =
if f(h) then(h::filter(f,T)) if f(h) then future(h::filter(f,T))
else filter(f,T) else filter(f,T)
in in
qs(S,[]) qs(S,[])
15-210 Page53 15-210 Page54
end end
Quicksort: Nested Parallel Quicksort: Futures

Parallel Partition and Append
Work = O(n log n)

Work = O(n log n)
Span = O(lg2 n)
Span = O(n)
15-210 55 15-210 56
14

Styles of Parallel Programming Xeon 7500
Data parallelism/Bulk Synchronous/SPMD Memory: up to 1 TB
* Nested parallelism : what we covered
4 of these
Message passing
24 MB L3 24 MB
* Futures (other pipelined parallelism)
8 of these 8 of these
* General Concurrency
128 KB 128 KB L2 128 KB 128 KB
32 KB 32 KB L1 32 KB 32 KB
P P P P
15-210 Page57 15-210 Page58
Question
Greedy Schedules
How do we get nested parallel programs to work well
“Speedup versus Efficiency in Parallel Systems”,
on a fixed set of processors? Programs say
Eager, Zahorjan and Lazowska, 1989
nothing about processors.
For any greedy schedule:
Answer: good schedulers W PW
Efficiency = ≥
TP W + D( P − 1)
W
Parallel Time = TP ≤ +D
P
15-210 Page59 15-210 60
15

Types of Schedulers Breadth First Schedules
Bread-First Schedulers
Most naïve schedule. Used by most implementations
Work-stealing Schedulers
of P-threads.
Depth-first Schedulers
O(n3) tasks
Bad space usage, bad locality

15-210 Page61 15-210 62
Work Stealing Work Stealing

P1 P2 P3 P4
old Tends to schedule “sequential blocks” of tasks
Local
work queues new
push new jobs on “new” end

pop jobs from “new” end
If processor runs out of work, then “steal” from
another “old” end = steal
Each processor tends to execute a sequential part of
the computation.
15-210 63 15-210 64
16

Work Stealing Theory Work Stealing Practice
For strict computations Used in Cilk Scheduler
Blumofe and Leiserson, 1999 –  Small overheads because common case of
# of steals = O(PD) pushing/popping from local queue can be made
Space = O(PS1) S1 is the sequential space fast (with good data structures and compiler
help).
Acar, Blelloch and Blumofe, 2003
–  No contention on a global queue
# of cache misses on distributed caches
M1+ O(CPD) –  Has good distributed cache behavior
M1 = sequential misses, C = cache size –  Can indeed require O(S1P) memory
Used in X10 scheduler, and others
15-210 65 15-210 66
Parallel Depth First Schedules (P-

“Premature task” in P-DFS
DFS)
List scheduling based on Depth-First ordering A running task is premature if there is an earlier
sequential task that is not complete
1 2 processor 2 processor
schedule 1 1 schedule
2 6
1 2 6 2 6 1
3 4 7 8 2, 6 2, 6
3, 4 3 4 7 8 3 4 7 8 3, 4
5 9 5, 7 5, 7
8 5 9 5 9 8
10
9 10 10 9
10 10
For strict computations a shared stack
implements a P-DFS = premature
15-210 67 15-210 68
17

P-DFS Theory P-DFS Practice
Blelloch, Gibbons, Matias, 1999 Experimentally uses less memory than work stealing
For any computation: and performs better on a shared cache.
Premature nodes at any time = O(PD) Requires some “coarsening” to reduce overheads
Space = S1 + O(PD)
Blelloch and Gibbons, 2004
With a shared cache of size C1 + O(PD) we have Mp =
M1
15-210 69 15-210 70
Conclusions
-  lots of parallel languages
-  lots of parallel programming styles
-  high-level ideas cross between languages and
styles
-  scheduling is important
15-210 Page71
18

Cray-1 (1976) : The World's Most Expensive Love Seat

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cray-1 (1976) : The World's Most Expensive Love Seat

Uploaded by

Copyright:

Available Formats

Cray-1 (1976): the world’s most

expensive love seat

15-210: Parallelism in the Real World

15-210 Page1 15-210 2

Data Center: Hundred’s of

15-210 Page5 15-210 Page6

Moore’s Law Moore’s Law and Performance

15-210 Andrew Chien, 2008 Page 9 15-210 Page 10

1024 “cuda” cores

Key Challenge: Software

Quicksort from Sedgewick (2003) Styles of Parallel Programming

15-210 Page 19 15-210 Page20

Nested Parallelism: parallel loops Nested Parallelism: fork-join

Cilk vs. what we’ve covered Example Cilk

15-210 Page27 15-210 Page28

We can approximate the

where each rectangle has

The C/openMP code for Approx. PI

Work : total number of operations pfor (i=0; i<n; i++)

How do the problems do on

1. Work should be about the same as the 8"

15-210 Page37 15-210 38

Concurrency : Stack Example 1 Concurrency : Stack Example 1

CAS Concurrency : Stack Example 2

Concurrency : Stack Example 2 Concurrency : Stack Example 2

Concurrency : Stack Example 3’ Styles of Parallel Programming

15-210 51 15-210 Page52

fun filter(f,[]) = [] fun filter(f,[]) = []

Quicksort: Nested Parallel Quicksort: Futures

Work = O(n log n)

15-210 Page57 15-210 Page58

15-210 Page59 15-210 60

Bad space usage, bad locality

Work Stealing Work Stealing

push new jobs on “new” end

Parallel Depth First Schedules (P-

You might also like

1.  Work should be about the same as the 8"