Professional Documents
Culture Documents
15-210 3 15-210 4
1
Xeon Phi: Knights Landing Xeon Phi: Knights Landing
72 Cores, x86, 500Gbytes of memory bandwidth 72 Cores, x86, 500Gbytes of memory bandwidth
Summer 2015 Summer 2015
15-210 7 15-210 8
2
64 core blade servers ($6K)
(shared memory)
x4=
15-210 11 15-210 12
3
Parallel Hardware
Many forms of parallelism
– Supercomputers: large scale, shared memory
– Clusters and data centers: large-scale,
distributed memory
– Multicores: tightly coupled, smaller scale
– GPUs, on chip vector units
– Instruction-level parallelism
Parallelism is important in the real world.
15-210 13 15-210 14
4
Quicksort from
Parallel Thinking Aho-Hopcroft-Ullman (1974)
Recognizing true dependences: unteach sequential
procedure QUICKSORT(S):
programming.
if S contains at most one element then return S
else
Parallel algorithm-design techniques
begin
– Operations on aggregates: map/reduce/scan choose an element a randomly from S;
– Divide & conquer, contraction let S1, S2 and S3 be the sequences of
– Viewing computation as DAG (based on elements in S less than, equal to,
dependences) and greater than a, respectively;
return (QUICKSORT(S1) followed by S2
Cost model based on work and span followed by QUICKSORT(S3))
end
15-210 17 15-210 Page 18
5
Nested Parallelism Serial Parallel DAGs
Nested Parallelism = Dependence graphs of nested parallel computations are
arbitrary nesting of parallel loops + fork-join series parallel
– Assumes no synchronization among parallel
tasks except at joint points.
– Deterministic if no race conditions
Advantages:
– Good schedulers are known Two tasks are parallel if not reachable from each other.
– Easy to understand, debug, and analyze costs A data race occurs if two parallel tasks are involved in a
– Purely functional, or imperative…either works race if they access the same location and at least one
is a write.
15-210 Page21 15-210 Page22
!
!
! !
! !
!
6
Nested Parallelism: fork-join Cilk vs. what we’ve covered
spawn S1;! ! ML: val (a,b) = par(fn () => f(x), fn () => g(y))
S2;! cilk, cilk+!
sync;! ! Psuedocode: val (a,b) = (f(x) || g(y))
! !
Cilk: cilk_spawn a = f(x);
!
!
b = g(y); Fork Join
(exp1 || exp2)!
!
Various functional
cilk_sync;
languages!
! !
plet! !
x = exp1! Various dialects of ML: S = map f A
y = exp2! ML and Lisp!
Psuedocode: S = <f x : x in A>
Map
in! !
exp3! ! Cilk: cilk_for (int i = 0; i < n; i++)
!
15-210 ! Page25 15-210 S[i] = f(A[i]) Page26
!
!
!
!
!
7
Example OpenMP:
Numerical Integration The C code for Approximating PI
Mathematically, we know that:
1
4.0 4.0
∫ (1+x2)
dx = π
0
integral as a sum of
2.0 rectangles:
N
∑ F(x )Δx ≈ πi
i=0
8
Cost Model (General)
Combining costs (Nested Parallelism)
Compositional:
Combining for parallel for:
Span : depth/critical path of the computation W pexp (pfor ...) = ∑W exp (f(i)) work
– Maximum span is taken across forked calls i=0
span
n −1
Dpexp (pfor ...) = max i=0 Dexp (f(i))
Parallelism = Work/Span
– Approximately # of processors that can be €
effectively used. €
15-210 Page33 15-210 34
20"
This is within a constant factor of optimal. 16"
Goals in designing an algorithm 12"
Sp pen e"
dt ng"F t"
De irst" st"
ia ay"T h"
."
g ."
se s"
x"A "
y"
N "
y
xV
Ne e"R ng
r
r
Su bod
rra
de "Tre
t"N te
te So
re
Sp hbo
Br ann d."S
un ear
"M
o
ria
O(n1/2) is probably good enough
es y"In
o
g
ei
ffi
a
ar
"
i
n
F
a
ica
l
h"
ng
" In
la
15-210 35 15-210
ar
36
pl
in
ea
Du
Tr
M
9
Styles of Parallel Programming Parallelism vs. Concurrency
Data parallelism/Bulk Synchronous/SPMD
" Parallelism: using multiple processors/cores
Nested parallelism : what we covered running at the same time. Property of the machine
Message passing " Concurrency: non-determinacy due to interleaving
threads. Property of the application.
Futures (other pipelined parallelism)
General Concurrency Concurrency
sequential concurrent
Traditional Traditional
serial
programming OS
Parallelism
Deterministic General
parallel
parallelism parallelism
15-210 39 15-210 40
10
Concurrency : Stack Example 1 Concurrency : Stack Example 1
struct link {int v; link* next;} struct link {int v; link* next;}
struct stack { struct stack {
link* headPtr; A link* headPtr; A
void push(link* a) { H void push(link* a) { H
a->next = headPtr; a->next = headPtr;
headPtr = a; } B headPtr = a; } B
link* pop() { link* pop() {
link* h = headPtr; link* h = headPtr;
if (headPtr != NULL) if (headPtr != NULL)
headPtr = headPtr->next; headPtr = headPtr->next;
return h;} return h;}
} }
15-210 41 15-210 42
11
Concurrency : Stack Example 2 Concurrency : Stack Example 2
struct stack { struct stack {
link* headPtr; link* headPtr;
void push(link* a) { A void push(link* a) { A
do { do {
link* h = headPtr; link* h = headPtr;
a->next = h; H a->next = h; H
while (!CAS(&headPtr, h, a)); } while (!CAS(&headPtr, h, a)); }
link* pop() { link* pop() { B
do { do {
link* h = headPtr; link* h = headPtr;
if (h == NULL) return NULL; if (h == NULL) return NULL;
link* nxt = h->next; link* nxt = h->next;
while (!CAS(&headPtr, h, nxt))} while (!CAS(&headPtr, h, nxt))}
return h;} return h;}
} }
15-210 45 15-210 46
12
Concurrency : Stack Example 2’ Concurrency : Stack Example 3
P1 : x = s.pop(); y = s.pop(); s.push(x); struct link {int v; link* next;}
P2 : z = s.pop(); struct stack {
link* headPtr;
void push(link* a) {
Before: A B C atomic {
a->next = headPtr;
headPtr = a; }}
After: B C P2: h = headPtr;
link* pop() {
P2: nxt = h->next; atomic {
P1: everything link* h = headPtr;
The ABA problem P2: CAS(&headPtr,h,nxt) if (headPtr != NULL)
headPtr = headPtr->next;
Can be fixed with counter and 2CAS, but… return h;}}
}
15-210 49 15-210 50
13
Futures: Example Futures: Example
fun quickSort S = let fun quickSort S = let
fun qs([], rest) = rest fun qs([], rest) = rest
| qs(h::T, rest) = | qs(h::T, rest) =
let let
val L1 = filter (fn b => b < a) T val L1 = filter (fn b => b < a) T
val L2 = filter (fn b => b >= a) T val L2 = filter (fn b => b >= a) T
in in
qs(L1,(a::(qs(L2, rest))) qs(L1, future(a::(qs(L2, rest)))
end end
Span = O(lg2 n)
Span = O(n)
15-210 55 15-210 56
14
Styles of Parallel Programming Xeon 7500
Data parallelism/Bulk Synchronous/SPMD Memory: up to 1 TB
* Nested parallelism : what we covered
4 of these
Message passing
24 MB L3 24 MB
* Futures (other pipelined parallelism)
8 of these 8 of these
* General Concurrency
128 KB 128 KB L2 128 KB 128 KB
32 KB 32 KB L1 32 KB 32 KB
P P P P
Question
Greedy Schedules
How do we get nested parallel programs to work well
“Speedup versus Efficiency in Parallel Systems”,
on a fixed set of processors? Programs say
Eager, Zahorjan and Lazowska, 1989
nothing about processors.
For any greedy schedule:
Answer: good schedulers W PW
Efficiency = ≥
TP W + D( P − 1)
W
Parallel Time = TP ≤ +D
P
15
Types of Schedulers Breadth First Schedules
Bread-First Schedulers
Most naïve schedule. Used by most implementations
Work-stealing Schedulers
of P-threads.
Depth-first Schedulers
O(n3) tasks
16
Work Stealing Theory Work Stealing Practice
For strict computations Used in Cilk Scheduler
Blumofe and Leiserson, 1999 – Small overheads because common case of
# of steals = O(PD) pushing/popping from local queue can be made
Space = O(PS1) S1 is the sequential space fast (with good data structures and compiler
help).
Acar, Blelloch and Blumofe, 2003
– No contention on a global queue
# of cache misses on distributed caches
M1+ O(CPD) – Has good distributed cache behavior
M1 = sequential misses, C = cache size – Can indeed require O(S1P) memory
Used in X10 scheduler, and others
15-210 65 15-210 66
17
P-DFS Theory P-DFS Practice
Blelloch, Gibbons, Matias, 1999 Experimentally uses less memory than work stealing
For any computation: and performs better on a shared cache.
Premature nodes at any time = O(PD) Requires some “coarsening” to reduce overheads
Space = S1 + O(PD)
Blelloch and Gibbons, 2004
With a shared cache of size C1 + O(PD) we have Mp =
M1
15-210 69 15-210 70
Conclusions
- lots of parallel languages
- lots of parallel programming styles
- high-level ideas cross between languages and
styles
- scheduling is important
15-210 Page71
18