Professional Documents
Culture Documents
Tradeoffs
(plus Msg Passing Finish)
CS 258, Spring 99
David E. Culler
Computer Science Division
U.C. Berkeley
SAS Recap
Processes
P1
Sweep
P2
P4
Test Convergence
P0
P1
P1
P2
P4
P2
/*communicate local diff values and determine if done, using reduction and broadcast*/
25b. REDUCE(0,mydiff,sizeof(float),ADD);
25c. if (pid == 0) then
25i. if (mydiff/(n*n) < TOL) then done = 1;
25k. endif
25m. BROADCAST(0,done,sizeof(int),DONE);
2/3/99 CS258 S99.5 6
Send and Receive Alternatives
• extended functionality: stride, scatter-gather, groups
• Sychronization semantics
– Affect when data structures or buffers can be reused at either end
– Affect event synch (mutual excl. by fiat: only one process touches data)
– Affect ease of programming and performance
• Synchronous messages provide built-in synch. through match
– Separate event synchronization may be needed with asynch. messages
• With synch. messages, our code may hang. Fix?
Send/Receive
Synchronous Asynchronous
Sequential Work
Speedup <
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
P1 P1
P2 P2
P3 P3
– dynamic assignment
P1
• Schedule more carefully
– avoid serialization P2
– estimate work
P4
– use history info.
for_all i = 1 to n do
for_all j = i to n do
A[ i, j ] = A[i-1, j] + A[i, j-1] + ...
QQ 0 Q1 Q2 Q3
Others may
steal
All remove tasks P0 removes P1 removes P2 removes P3 removes
(a) Centralized task queue (b) Distributed task queues (one per pr ocess)
2/3/99 CS258 S99.5 15
Impact of Dynamic Assignment
• Barnes-Hut on SGI Origin 2000 (cache-coherent shared memory):
Speedup
15 15
10 10
5 5
(a) 0 (b) 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors Number of processors
2/3/99 CS258 S99.5 16
Self-Scheduling
volatile int row_index = 0; /* shared index variable */
Sequential Work
Speedup <
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
2/3/99 CS258 S99.5 21
Reducing Inherent Communication
Sequential Work
Speedup <
Max (Work + Synch Wait Time + Comm Cost)
• Communication is expensive!
• Measure: communication to computation ratio
• Inherent communication
– Determined by assignment of tasks to processes
– One produces data consumed by others
=> Use algorithms that communicate less
=> Assign tasks that access same data to same
process
– same row or block to same process in each iteration
P4 P5 P6 P7
n n
p
P8 P9 P10 P11
P0 P1 P2 P3
P4 P5 P6 P7
n
-----
- n
p P8 P9 P10 P11
4*√p
• Comm to comp: n for block, 2*p for strip
• Application dependent: strip may nbe better in other cases
– E.g. particle flow in tunnel
12 12 12 12
3 4 3 4 3 4 3 4
12
12 12 12 12
3 4 3 4 3 4 3 4
12 12 12 12
3 4 3 4 3 4 3 4
3 4
12 12 12 12
3 4 3 4 3 4 3 4
Sequential Work
Speedup <
Max (Work + Synch Wait Time + Comm Cost)
Flat
Tree structured
•Module: all-to-all personalized comm. in matrix transpose
–solution:
stagger access by different processors to same
node temporally
•In general, reduce burstiness; may conflict with making
messages larger
2/3/99 CS258 S99.5 31
Overlapping Communication
• Cannot afford to stall for high latencies
– even on uniprocessors!
• Overlap with computation or communication to
hide latency
• Requires extra concurrency (slackness), higher
bandwidth
• Techniques:
– Prefetching
– Block data transfer
– Proceeding past communication
– Multithreading
7
1.00E+06
6
FT FT
5 IS IS
1.00E+05
LU LU
4
MG MG
3 1.00E+04
SP SP
BT BT
2
1.00E+03
1
0 1.00E+02
0 10 20 30 40 0 10 20 30 40
1.2 9.00E+09
8.00E+09
1
7.00E+09
FT FT
0.8 6.00E+09
IS IS
LU 5.00E+09 LU
0.6
MG 4.00E+09 MG
SP SP
0.4 3.00E+09
BT BT
2.00E+09
0.2
1.00E+09
0 0.00E+00
0 10 20 30 40 0 10 20 30 40
Data traffic
First working set
Capacity-generated traffic
(including conflicts)
Second working set
Inherent communication
(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4
P0 P1 P2 P3 P0 P1 P2 P3
P4 P5 P6 P7 P4 P5 P6 P7
P8 P8
35
Rows-rr
20 2D
30 2D-rr
Speedup
Speedup
15 25
20
10
15
10
5
5
0 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Number of processors Number of processors
8-fold reduction
in miss rate from
4 to 8 proc
3000
2500
2000
Wait
Total Time
Receive
1500
Send
Compute
1000
500
0
4 8 16 32
Processors