Professional Documents
Culture Documents
• MPI
– Message Passing Interface
– API for distributed memory parallel computing (multiple processes)
– The dominant model used in cluster computing
• OpenMP
– Open Multi-Processing
– API for shared memory parallel computing (multiple threads)
• Parallel Matlab
– A popular high-level technical computing language and interactive environment 1
Learning Resources
• Books
– http://www.mcs.anl.gov/~itf/dbpp/
– https://computing.llnl.gov/tutorials/parallel_comp/
– http://www-users.cs.umn.edu/~karypis/parbook/
• Journals
– http://www.computer.org/tpds
– http://www.journals.elsevier.com/parallel-computing/
– http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/
• CUDA
– http://developer.nvidia.com
2
Half Adder
A: Augend B: Addend
S: Sum C: Carry
3
Full Adder
4
SR Latch
S R Q
0 0 Q
0 1 0
1 0 1
1 1 N/A
5
Address Decoder
6
Address Decoder
7
Electronic Numerical Integrator And Computer
• Programming
– Programmable
– Switches and Cables
– Usually took days.
– I/O: Punched Cards
Ada Lovelace
9
First PhD in Computer Science
11
Personal Computer in 1980s
12
13
14
Top 500 Supercomputers
GFLOPS
15
Cost of Computing
Approximate cost per GFLOPS
Date Approximate cost per GFLOPS
inflation adjusted to 2013 dollars
18
Parallel Sum
1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9
8 19 7 15 7 13 12 14
0 1 2 3 4 5 6 7 Cores
0 95
19
Parallel Sum
1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9
8 19 7 15 7 13 12 14
0 1 2 3 4 5 6 7 Cores
27 0 22 2 20 4 26 6
49 0 46 4
0 95
20
Prefix Scan
Original Vector 3 5 2 5 7 9 4 6
prefixScan[0]=A[0];
for (i=1; i<N; i++)
prefixScan[i]=prfixScan[i-1]+A[i];
21
Parallel Prefix Scan
3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2
3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2
3 8 10 15 7 16 12 18 7 4 5 12 6 14 13 15
15 18 12 15
3 8 10 15 22 31 27 33 40 37 38 45 51 59 58 60
22
Levels of Parallelism
• Embarrassingly Parallel
– No dependency or communication between parallel tasks
• Coarse-Grained Parallelism
– Infrequent communication, large amounts of computation
• Fine-Grained Parallelism
– Frequent communication, small amounts of computation
– Greater potential for parallelism
– More overhead
• Not Parallel
– Giving life to a baby takes 9 months.
– Can this be done in 1 month by having 9 women?
23
Types of Parallelism
• Task Parallelism
– Different tasks on the same/different sets of data
• Data Parallelism
– Similar tasks on different sets of the data
• Example
– 5 TAs, 100 exam papers, 5 questions
– How to make it task parallel?
– How to make it data parallel?
24
Data Decomposition
2 Cores
25
Granularity
8 Cores
26
Coordination
• Communication
– Sending partial results to other cores
• Load Balancing
– Wooden Barrel Principle
• Synchronization
– Race Condition
Thread A Thread B
1A: Read variable V 1B: Read variable V
2A: Add 1 to variable V 2B: Add 1 to variable V
3A Write back to variable V 3B: Write back to variable V
27
Data Dependency
• Bernstein's Conditions
I j Oi
Flow Dependency
I i O j
Oi O j Output Dependency
• Examples
1: function NoDep(a,
1: function Dep(a, b)
b)
2: c = a·b
2: c = a·b
3: d = 3·c
3: d = 3·b
4: end function
4: e = a+b
5: end function
28
What is not parallel?
Recurrences
Loop-Carried Dependence
29
What is not parallel?
i1=4; i1=4;
i2=0; i2=0;
for (k=1; k<N; k++) { for (k=1; k<N; k++) {
B[i1++]=function1(k,q,r); B[k+3]=function1(k,q,r);
i2+=k; i2=(k*k+k)/2;
A[i2]=function2(k,r,q); A[i2]=function2(k,r,q);
} }
30
Assembly Line
15 20 5
• How long is the gap between producing the first and the second car?
1: Add 1 to R5.
2: Copy R5 to R6.
33
Computing Models
• Concurrent Computing
– Multiple tasks can be in progress at any instant.
• Parallel Computing
– Multiple tasks can be run simultaneously.
• Distributed Computing
– Multiple programs on networked computers work collaboratively.
• Cluster Computing
– Homogenous, Dedicated, Centralized
• Grid Computing
– Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed
34
Concurrent vs. Parallel
Job 1 Job 2
Job 1 Job 2 Job 1 Job 2 Job 3 Job 4
Core 1 Core 1
Core Core 1 Core 1
Core Core 2 Core 2
Core 2 Core 2
35
Process & Thread
• Process
– An instance of a computer program being executed
• Threads
– The smallest units of processing scheduled by OS
– Exist as a subset of a process.
– Share the same resources from the process.
– Switching between threads is much faster than switching between processes.
• Multithreading
– Better use of computing resources
– Concurrent execution
– Makes the application more responsive
Thread
Process
36
Thread
Parallel Processes
Node 1 Process 1
Node 1 Process 1
Node 3 Process 3
Node 3 Process 3
38
Graphics Processing Unit
39
CPU vs. GPU
40
CUDA
41
CUDA
42
CUDA
43
GPU Computing Showcase
44
MapReduce vs. GPU
• Pros:
– Run on clusters of hundreds or thousands of commodity computers.
– Can handle excessive amount of data with fault tolerance.
– Minimum efforts required for programmers: Map & Reduce
• Cons:
– Intermediate results are stored in disks and transferred via network links.
– Suitable for processing independent or loosely coupled jobs.
– High upfront hardware cost and operational cost
– Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar
45
Parallel Computing in Matlab
for i=1:1024
A(i) = sin(i*2*pi/1024);
end
plot(A);
parfor i=1:1024
A(i) = sin(i*2*pi/1024);
end
plot(A);
matlabpool close
46
GPU Computing in Matlab
http://www.mathworks.cn/discovery/matlab-gpu.html
47
Cloud Computing
48
Everything is Cloud …
49
Five Attributes of Cloud Computing
• Service Based
– What the service needs to do is more important than how the technologies are used to
implement the solution.
• Shared
– Services share a pool of resources to build economies of scale.
• Metered by Use
– Services are tracked with usage metrics to enable multiple payment models.
52
Von Neumann Architecture
Harvard Architecture
53
Inside a PC ...
Front-Side Bus (Core 2 Extreme)
Memory (DDR3-1600)
54
Shared Memory System
Interconnect
Memory
Memory
55
Non-Uniform Memory Access
Remote Access
Core 1 Core 2 Core 1 Core 2
Core 1 Core 2 Core 1 Core 2
Interconnect Interconnect
Memory Memory
Memory Memory
56
Distributed Memory System
...
Communication Networks
57
Crossbar Switch
P1 P2 P3 P4
P1 P2 P3 P4
M1
M1
M2
M2
M3
M3
M4
M4
58
Cache
• Component that transparently stores data so that future requests for that data
can be served faster
– Compared to main memory: smaller, faster, more expensive
– Spatial Locality
– Temporal Locality
• Cache Line
– A block of data that is accessed together
• Cache Miss
– Failed attempts to read or write a piece of data in the cache
– Main memory access required
– Read Miss, Write Miss
– Compulsory Miss, Capacity Miss, Conflict Miss
59
Writing Policies
60
Cache Mapping
Memory Cache Memory Cache
Column Major
#define MAX 4 0,0 0,1 0,2 0,3
double A[MAX][MAX], x[MAX], y[MAX];
1,0 1,1 1,2 1,3
/* Initialize A and x, assign y=0 */ 2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
for (i=0; i<MAX, i++)
for (j=0; j<MAX; j++)
y[i]+=A[i][j]*x[j];
0 y0=x; y1=3*x;
1 x=7; Statements without x
Cache 0 Cache 1
2 Statements without x z1=4*x; Cache 0 Cache 1
Interconnect
What is the value of z1?
A A AB AB
A A AB AB
invalidate invalidate
update A reload A update AB reload AB
A=5 (A=5)B
A=5 (A=5)B
64
False Sharing
int i, j, m, n; /* Private variables */
double y[m]; int i, j, iter_count;
• Page
– A block of continuous virtual memory addresses
– The smallest unit to be swapped in/out of main memory
from/into secondary storage.
• Page Table
– Used to store the mapping between virtual addresses and
physical addresses.
• Page Fault
– The accessed page is not in the physical memory.
66
Interleaving Statements
s1 s1 s1 s1 s1 s1
s1 s1 s1 s1 s1 s1
T0 T1
s1 s1 s2 s1 s1 s1 s1 s2
s1 s1 s2 s1 s1 s1 s1 s2
s2 s2 s1 s2 s2 s2 s2 s1
s2 s2 s1 s2 s2 s2 s2 s1
s2 s2 s2 s2 s2 s2
s2 s2 s2 s2 s2 s2
(M N )!
C M
M N
M! N! 67
Critical Region
• A portion of code where shared resources are accessed and updated
• Threads are disallowed from entering the critical region when another thread is
occupying the critical region.
• If a thread is not executing within the critical region, that thread must not prevent
another thread seeking entry from entering the region.
int threadNumber = 0;
• Q1: Can T1 enter the critical region more times than T0?
• Q2: What would happen if T0 terminates (by design or by accident)?
69
Second Attempt
int Thread0inside = 0;
int Thread1inside = 0;
• Q1: Can T1 enter the critical region multiple times when T0 is not within the critical region?
• Q2: Can T1 and T2 be allowed to enter the critical region at the same time?
70
Third Attempt
int Thread0WantsToEnter = 0;
int Thread1WantsToEnter = 0;
71
Fourth Attempt
int Thread0WantsToEnter = 0;
int Thread1WantsToEnter = 0;
• Partitioning
– Divide the computation to be performed and the data operated on by the computation into small
tasks.
• Communication
– Determine what communication needs to be carried out among the tasks.
• Agglomeration
– Combine tasks that communicate intensively with each other or must be executed sequentially
into larger tasks.
• Mapping
– Assign the composite tasks to processes/threads to minimize inter-processor communication and
maximize processor utilization.
74
Parallel Histogram
0 1 2 3 4 5 75
Parallel Histogram
data[i- data[i+1 data[i+2
data[i]
1] ] ]
loc_bin_cts[b-1]++ loc_bin_cts[b]++
loc_bin_cts[b-1]++ loc_bin_cts[b]++
loc_bin_cts[b-1]++ loc_bin_cts[b]++
loc_bin_cts[b-1]++ loc_bin_cts[b]++
bin_counts[b-1]+= bin_counts[b]+=
76
Performance
• Speedup
T Serial
S
TParallel
• Efficiency
S TSerial
E
N N TParallel
• Scalability
– Problem Size, Number of Processors
• Strongly Scalable
– Same efficiency for larger N with fixed problem size
• Weakly Scalable
– Same efficiency for larger N with a fixed problem size per processor
77
Amdahl's Law
1
S(N )
(1 P ) P
N
78
Gustafson's Law
TParallel a b TSerial a N b
sequential parallel
a N b a
S(N ) N ( N 1) for
ab ab
• Linear speedup can be achieved when:
– Problem size is allowed to grow monotonously with N.
– The sequential part is fixed or grows slowly.
80
Review
• Name the four categories of memory systems.
81
Review
• Name three major APIs for parallel computing.
82