Advanced Computing Tecnique

Glossary
• MPI
– Message Passing Interface
– API for distributed memory parallel computing (multiple processes)
– The dominant model used in cluster computing
• OpenMP
– Open Multi-Processing
– API for shared memory parallel computing (multiple threads)
• GPU Computing with CUDA

– Graphics Processing Unit
– Compute Unified Device Architecture
– API for shared memory parallel computing in C (multiple threads)
• Parallel Matlab
– A popular high-level technical computing language and interactive environment 1
Learning Resources
• Books
– http://www.mcs.anl.gov/~itf/dbpp/
– https://computing.llnl.gov/tutorials/parallel_comp/
– http://www-users.cs.umn.edu/~karypis/parbook/
• Journals
– http://www.computer.org/tpds
– http://www.journals.elsevier.com/parallel-computing/
– http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/
• Amazon Cloud Computing Services

– http://aws.amazon.com
• CUDA
– http://developer.nvidia.com
2
Half Adder
A: Augend B: Addend
S: Sum C: Carry
3
Full Adder
4
SR Latch
S R Q
0 0 Q
0 1 0
1 0 1
1 1 N/A
5
Address Decoder
6
Address Decoder
7
Electronic Numerical Integrator And Computer
• Programming
– Programmable
– Switches and Cables
– Usually took days.
– I/O: Punched Cards
• Speed (10-digit decimal numbers)

– Machine Cycle: 5000 cycles per second
– Multiplication: 357 times per second
8
– Division/Square Root: 35 times per second
Pioneer Programmer
Ada Lovelace
9
First PhD in Computer Science
Mary Kenneth Keller

10
Stored-Program Computer
11
Personal Computer in 1980s
BASIC IBM PC/AT
12
13
14
Top 500 Supercomputers
GFLOPS
15
Cost of Computing
Approximate cost per GFLOPS
Date Approximate cost per GFLOPS
inflation adjusted to 2013 dollars
1984 $15,000,000 $33,000,000

1997 $30,000 $42,000
April 2000 $1,000 $1,300
May 2000 $640 $836
August 2003 $82 $100
August 2007 $48 $52
March 2011 $1.80 $1.80
August 2012 $0.75 $0.73
December 2013 $0.12 $0.12
January 2015 $0.08 $0.08
16
Complexity of Computing
• A: 10×100 B: 100×5 C: 5×50
• (AB)C vs. A(BC)
• A: N×N B: N×N C=AB
• Time Complexity: O(N3)
• Space Complexity: O(1)

17
Why Parallel Computing?
• Why we need every-increasing performance:

– Big Data Analysis
– Climate Modeling
– Gaming
• Why we need to build parallel systems:

– Increase the speed of integrated circuits  Overheating
– Increase the number of transistors  Multi-Core
• Why we need to learn parallel programming:

– Running multiple instances of the same program is unlikely to help.
– Need to rewrite serial programs to make them parallel.
18
Parallel Sum
1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9
8 19 7 15 7 13 12 14
0 1 2 3 4 5 6 7 Cores
0 95
19
Parallel Sum
1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9
8 19 7 15 7 13 12 14
0 1 2 3 4 5 6 7 Cores
27 0 22 2 20 4 26 6
49 0 46 4
0 95
20
Prefix Scan
Original Vector 3 5 2 5 7 9 4 6
Inclusive Prefix Scan 3 8 10 15 22 31 35 41
Exclusive Prefix Scan 0 3 8 10 15 22 31 35
prefixScan[0]=A[0];
for (i=1; i<N; i++)
prefixScan[i]=prfixScan[i-1]+A[i];
21
Parallel Prefix Scan
3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2
3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2
3 8 10 15 7 16 12 18 7 4 5 12 6 14 13 15
15 18 12 15
Exclusive Prefix Scan 0 15 33 45
3 8 10 15 22 31 27 33 40 37 38 45 51 59 58 60
22
Levels of Parallelism
• Embarrassingly Parallel
– No dependency or communication between parallel tasks
• Coarse-Grained Parallelism
– Infrequent communication, large amounts of computation
• Fine-Grained Parallelism
– Frequent communication, small amounts of computation
– Greater potential for parallelism
– More overhead
• Not Parallel
– Giving life to a baby takes 9 months.
– Can this be done in 1 month by having 9 women?
23
Types of Parallelism
• Instruction Level Parallelism
• Task Parallelism
– Different tasks on the same/different sets of data
• Data Parallelism
– Similar tasks on different sets of the data
• Example
– 5 TAs, 100 exam papers, 5 questions
– How to make it task parallel?
– How to make it data parallel?
24
Data Decomposition
2 Cores
25
Granularity
8 Cores
26
Coordination
• Communication
– Sending partial results to other cores
• Load Balancing
– Wooden Barrel Principle
• Synchronization
– Race Condition
Thread A Thread B
1A: Read variable V 1B: Read variable V
2A: Add 1 to variable V 2B: Add 1 to variable V
3A Write back to variable V 3B: Write back to variable V
27
Data Dependency
• Bernstein's Conditions
I j Oi  
Flow Dependency
I i O j  
Oi O j   Output Dependency
• Examples
1: function NoDep(a,
1: function Dep(a, b)
b)
2: c = a·b
2: c = a·b
3: d = 3·c
3: d = 3·b
4: end function
4: e = a+b
5: end function
28
What is not parallel?
Recurrences
for (i=1; i<N; i++)

a[i]=a[i-1]+b[i];
Loop-Carried Dependence
for (k=5; k<N; k++) {

b[k]=DoSomething(K);
a[k]=b[k-5]+MoreStuff(k);
}
29
What is not parallel?
Induction Variables Solution
i1=4; i1=4;
i2=0; i2=0;
for (k=1; k<N; k++) { for (k=1; k<N; k++) {
B[i1++]=function1(k,q,r); B[k+3]=function1(k,q,r);
i2+=k; i2=(k*k+k)/2;
A[i2]=function2(k,r,q); A[i2]=function2(k,r,q);
} }
30
Assembly Line
15 20 5
• How long does it take to produce a single car?
• How many cars can be operated at the same time?
• How long is the gap between producing the first and the second car?
• The longest stage on the assembly line determines the throughput.

31
Instruction Pipeline
1: Add 1 to R5.
2: Copy R5 to R6.
IF: Instruction fetch

ID: Instruction decode and register fetch
EX: Execute
MEM: Memory access
WB: Register write back 32
Superscalar
33
Computing Models
• Concurrent Computing
– Multiple tasks can be in progress at any instant.
• Parallel Computing
– Multiple tasks can be run simultaneously.
• Distributed Computing
– Multiple programs on networked computers work collaboratively.
• Cluster Computing
– Homogenous, Dedicated, Centralized
• Grid Computing
– Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed
34
Concurrent vs. Parallel
Job 1 Job 2
Job 1 Job 2 Job 1 Job 2 Job 3 Job 4
Core 1 Core 1
Core Core 1 Core 1
Core Core 2 Core 2
Core 2 Core 2
35
Process & Thread
• Process
– An instance of a computer program being executed
• Threads
– The smallest units of processing scheduled by OS
– Exist as a subset of a process.
– Share the same resources from the process.
– Switching between threads is much faster than switching between processes.
• Multithreading
– Better use of computing resources
– Concurrent execution
– Makes the application more responsive
Thread
Process
36
Thread
Parallel Processes
Node 1 Process 1
Node 1 Process 1
Program Node 2 Process 2

Program Node 2 Process 2
Node 3 Process 3
Node 3 Process 3
Single Program, Multiple Data

37
Parallel Threads
38
Graphics Processing Unit
39
CPU vs. GPU
40
CUDA
41
CUDA
42
CUDA
43
GPU Computing Showcase
44
MapReduce vs. GPU
• Pros:
– Run on clusters of hundreds or thousands of commodity computers.
– Can handle excessive amount of data with fault tolerance.
– Minimum efforts required for programmers: Map & Reduce
• Cons:
– Intermediate results are stored in disks and transferred via network links.
– Suitable for processing independent or loosely coupled jobs.
– High upfront hardware cost and operational cost
– Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar
45
Parallel Computing in Matlab
for i=1:1024
A(i) = sin(i*2*pi/1024);
end
plot(A);
matlabpool open local 3
parfor i=1:1024
A(i) = sin(i*2*pi/1024);
end
plot(A);
matlabpool close
46
GPU Computing in Matlab
http://www.mathworks.cn/discovery/matlab-gpu.html
47
Cloud Computing
48
Everything is Cloud …
49
Five Attributes of Cloud Computing
• Service Based
– What the service needs to do is more important than how the technologies are used to
implement the solution.
• Scalable and Elastic

– The service can scale capacity up or down as the consumer demands at the speed of full
automation.
• Shared
– Services share a pool of resources to build economies of scale.
• Metered by Use
– Services are tracked with usage metrics to enable multiple payment models.
• Uses Internet Technologies

– The service is delivered using Internet identifiers, formats and protocols.
50
Flynn’s Taxonomy
• Single Instruction, Single Data (SISD)
– von Neumann System
• Single Instruction, Multiple Data (SIMD)

– Vector Processors, GPU
• Multiple Instruction, Single Data (MISD)

– Generally used for fault tolerance
• Multiple Instruction, Multiple Data (MIMD)

– Distributed Systems
– Single Program, Multiple Data (SPMD)
– Multiple Program, Multiple Data (MPMD)
51
Flynn’s Taxonomy
52
Von Neumann Architecture
Harvard Architecture
53
Inside a PC ...
Front-Side Bus (Core 2 Extreme)
8B × 400MHZ × 4/Cycle = 12.8GB/S
Memory (DDR3-1600)
8B × 200MHZ × 4 × 2/Cycle = 12.8GB/S
PCI Express 3.0 (×16)
1GB/S × 16= 16GB/S
54
Shared Memory System
CPU CPU CPU ... CPU

CPU CPU CPU CPU
Interconnect
Memory
Memory
55
Non-Uniform Memory Access
Remote Access
Core 1 Core 2 Core 1 Core 2
Interconnect Interconnect
Local Access Local Access
Memory Memory
Memory Memory
56
Distributed Memory System
CPU CPU CPU

CPU CPU CPU
...
Memory Memory Memory

Memory Memory Memory
Communication Networks
57
Crossbar Switch
P1 P2 P3 P4
P1 P2 P3 P4
M1
M1
M2
M2
M3
M3
M4
M4
58
Cache
• Component that transparently stores data so that future requests for that data
can be served faster
– Compared to main memory: smaller, faster, more expensive
– Spatial Locality
– Temporal Locality
• Cache Line
– A block of data that is accessed together
• Cache Miss
– Failed attempts to read or write a piece of data in the cache
– Main memory access required
– Read Miss, Write Miss
– Compulsory Miss, Capacity Miss, Conflict Miss
59
Writing Policies
60
Cache Mapping
Memory Cache Memory Cache
Index Index Index Index

0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4
5 5
... ...
Direct Mapped 2-Way Associative

61
Cache Miss
Row Major
Column Major
#define MAX 4 0,0 0,1 0,2 0,3
double A[MAX][MAX], x[MAX], y[MAX];
1,0 1,1 1,2 1,3
/* Initialize A and x, assign y=0 */ 2,0 2,1 2,2 2,3
3,0 3,1 3,2 3,3
for (i=0; i<MAX, i++)
for (j=0; j<MAX; j++)
y[i]+=A[i][j]*x[j];
/* Assign y=0 */ Cache Memory
for (j=0; j<MAX, j++)

for (i=0; i<MAX; i++)
y[i]+=A[i][j]*x[j];
How many hit misses?

62
Cache Coherence
Core 0 Core 1
Time Core 0 Core 1 Core 0 Core 1
0 y0=x; y1=3*x;
1 x=7; Statements without x
Cache 0 Cache 1
2 Statements without x z1=4*x; Cache 0 Cache 1
Interconnect
What is the value of z1?
With write through policy …
With write back policy …

x=2 y1
x=2 y1
y0 z1
y0 z1
63
Cache Coherence

A=5 B=A2 A=5 B=B+1
A A AB AB
A A AB AB
invalidate invalidate
update A reload A update AB reload AB
A=5 (A=5)B
A=5 (A=5)B
A and B are called false sharing.
64
False Sharing
int i, j, m, n; /* Private variables */
double y[m]; int i, j, iter_count;
/* Assign y=0 */ /* Shared variables */

int m, n, core_count;
for (i=0; i<m; i++) double y[m];
for (j=0; j<n; j++)
y[i]+=f(i, j); iter_count=m/core_count;
/* Core 0 does this */

for (i=0; i<iter_count; i++)
m=8, two cores for (j=0; j<n; j++)
y[i]+=f(i, j);
cache line: 64 bytes
/* Core 1 does this */
for (i=iter_count; i<2*iter_count; i+
+)
for (j=0; j<n; j++) 65
y[i]+=f(i, j);
Virtual Memory
• Virtualization of various forms of computer data storage
into a unified address space
– Logically increases the capacity of main memory (e.g., DOS
can only access 1 MB of RAM).
• Page
– A block of continuous virtual memory addresses
– The smallest unit to be swapped in/out of main memory
from/into secondary storage.
• Page Table
– Used to store the mapping between virtual addresses and
physical addresses.
• Page Fault
– The accessed page is not in the physical memory.
66
Interleaving Statements
s1 s1 s1 s1 s1 s1
s1 s1 s1 s1 s1 s1
T0 T1
s1 s1 s2 s1 s1 s1 s1 s2
s1 s1 s2 s1 s1 s1 s1 s2
s2 s2 s1 s2 s2 s2 s2 s1
s2 s2 s1 s2 s2 s2 s2 s1
s2 s2 s2 s2 s2 s2
s2 s2 s2 s2 s2 s2
(M  N )!
C M
M N 
M! N! 67
Critical Region
• A portion of code where shared resources are accessed and updated
• Resources: data structure (variables), device (printer)
• Threads are disallowed from entering the critical region when another thread is
occupying the critical region.
• A means of mutual exclusion is required.
• If a thread is not executing within the critical region, that thread must not prevent
another thread seeking entry from entering the region.
• We consider two threads and one core in the following examples.

68
First Attempt
int threadNumber = 0;
void ThreadZero() void ThreadOne()

{ {
while (TRUE) do { while (TRUE) do {
while (threadNumber == 1) while (threadNumber == 0)
do {} // spin-wait do {} // spin-wait
CriticalRegionZero; CriticalRegionOne;
threadNumber=1; threadNumber=0;
OtherStuffZero; OtherStuffOne;
} }
} }
• Q1: Can T1 enter the critical region more times than T0?
• Q2: What would happen if T0 terminates (by design or by accident)?
69
Second Attempt
int Thread0inside = 0;
int Thread1inside = 0;

{ {
while (Thread1inside) do {} while (Thread0inside) do {}
Thread0inside = 1; Thread1inside = 1;
Thread0inside = 0; Thread1inside = 0;
} }
} }
• Q1: Can T1 enter the critical region multiple times when T0 is not within the critical region?
• Q2: Can T1 and T2 be allowed to enter the critical region at the same time?
70
Third Attempt
int Thread0WantsToEnter = 0;

{ {
Thread0WantsToEnter = 1; Thread1WantsToEnter = 1;
while (Thread1WantsToEnter) while (Thread0WantsToEnter)
do {} do {}
} }
} }
71
Fourth Attempt

{ {
do { do {
delay(someRandomCycles); delay(someRandomCycles);
} }
} }
} } 72
Dekker’s Algorithm
int Thread0WantsToEnter = 0, Thread1WantsToEnter = 0, favored =
0;
{ {
do { do {
if (favored == 1) { if (favored == 0) {
while (favored == 1) do {} while (favored == 0) do {}
} }
} }
favored = 1; favored = 0;
OtherStuffZero; OtherStuffZero;
} } 73
} }
Parallel Program Design
• Foster’s Methodology
• Partitioning
– Divide the computation to be performed and the data operated on by the computation into small
tasks.
• Communication
– Determine what communication needs to be carried out among the tasks.
• Agglomeration
– Combine tasks that communicate intensively with each other or must be executed sequentially
into larger tasks.
• Mapping
– Assign the composite tasks to processes/threads to minimize inter-processor communication and
maximize processor utilization.
74
Parallel Histogram
Find_bin() data[i- data[i+1

data[i]
1] ]
Increment bin_counts bin_counts[b-1]+

bin_counts[b]++
+
0 1 2 3 4 5 75
Parallel Histogram
data[i- data[i+1 data[i+2
data[i]
1] ] ]
loc_bin_cts[b-1]++ loc_bin_cts[b]++
bin_counts[b-1]+= bin_counts[b]+=
76
Performance
• Speedup
T Serial
S
TParallel
• Efficiency
S TSerial
E 
N N  TParallel
• Scalability
– Problem Size, Number of Processors
• Strongly Scalable
– Same efficiency for larger N with fixed problem size
• Weakly Scalable
– Same efficiency for larger N with a fixed problem size per processor
77
Amdahl's Law
1
S(N ) 
(1  P )  P
N
78
Gustafson's Law
TParallel  a  b TSerial  a  N  b
sequential parallel
a  N b a
S(N )   N    ( N  1) for  
ab ab
• Linear speedup can be achieved when:
– Problem size is allowed to grow monotonously with N.
– The sequential part is fixed or grows slowly.
• Is it possible to achieve super linear speedup?

79
Review
• Why is parallel computing important?
• What is data dependency?
• What are the benefits and issues of fine-grained parallelism?
• What are the three types of parallelism?
• What is the difference between concurrent and parallel computing?
• What are the essential features of cloud computing?
• What is Flynn’s Taxonomy?
80
Review
• Name the four categories of memory systems.
• What are the two common cache writing policies?
• Name the two types of cache mapping strategies.
• What is a cache miss and how to avoid it?
• What may cause the false sharing issue?
• What is a critical region?
• How to verify the correctness of a concurrent program?
81
Review
• Name three major APIs for parallel computing.
• What are the benefits of GPU computing compared to MapReduce?
• What is the basic procedure of parallel program design?
• What are the key performance factors in parallel programming?
• What is a strongly/weakly scalable parallel program?
• What is the implication of Amdahl's Law?
• What does Gustafson's Law tell us?
82

Advanced Computing Tecnique

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computing Tecnique

Uploaded by

Copyright:

Available Formats

Glossary

• GPU Computing with CUDA

• Amazon Cloud Computing Services

• Speed (10-digit decimal numbers)

Mary Kenneth Keller

BASIC IBM PC/AT

1984 $15,000,000 $33,000,000

• A: 10×100 B: 100×5 C: 5×50

• (AB)C vs. A(BC)

• A: N×N B: N×N C=AB

• Time Complexity: O(N3)

• Space Complexity: O(1)

• Why we need every-increasing performance:

• Why we need to build parallel systems:

• Why we need to learn parallel programming:

Inclusive Prefix Scan 3 8 10 15 22 31 35 41

Exclusive Prefix Scan 0 3 8 10 15 22 31 35

Exclusive Prefix Scan 0 15 33 45

• Instruction Level Parallelism

for (i=1; i<N; i++)

for (k=5; k<N; k++) {

Induction Variables Solution

• How long does it take to produce a single car?

• How many cars can be operated at the same time?

• The longest stage on the assembly line determines the throughput.

IF: Instruction fetch

Program Node 2 Process 2

Single Program, Multiple Data

matlabpool open local 3

• Scalable and Elastic

• Uses Internet Technologies

• Single Instruction, Multiple Data (SIMD)

• Multiple Instruction, Single Data (MISD)

• Multiple Instruction, Multiple Data (MIMD)

8B × 400MHZ × 4/Cycle = 12.8GB/S

8B × 200MHZ × 4 × 2/Cycle = 12.8GB/S

PCI Express 3.0 (×16)

1GB/S × 16= 16GB/S

CPU CPU CPU ... CPU

Local Access Local Access

CPU CPU CPU

Memory Memory Memory

Index Index Index Index

Direct Mapped 2-Way Associative

/* Assign y=0 */ Cache Memory

for (j=0; j<MAX, j++)

How many hit misses?

With write through policy …

With write back policy …

Core 0 Core 1 Core 0 Core 1

A=5 B=A2 A=5 B=B+1

A and B are called false sharing.

/* Assign y=0 */ /* Shared variables */

/* Core 0 does this */

• Resources: data structure (variables), device (printer)

• A means of mutual exclusion is required.

• We consider two threads and one core in the following examples.

void ThreadZero() void ThreadOne()

void ThreadZero() void ThreadOne()

void ThreadZero() void ThreadOne()

void ThreadZero() void ThreadOne()

Find_bin() data[i- data[i+1

Increment bin_counts bin_counts[b-1]+

• Is it possible to achieve super linear speedup?

/* Assign y=0 / / Shared variables */