Professional Documents
Culture Documents
Chapter 7:
Multicores, Multiprocessors, and Clusters
1
§7.1 Introduction
7.1. Introduction
❑ Multicore microprocessors
l Chips with multiple processors (cores)
2
Types of Parallelism
Time
Time
Time
Time
Hardware Software
4
What We’ve Already Covered
❑ §2.11: Parallelism and Instructions
l Synchronization
5
§7.2 The Difficulty of Creating Parallel Processing Programs
7.2. The Difficulty of Creating Parallel Processing Programs
❑ Difficulties
l Partitioning
l Coordination
l Communications overhead
6
Amdahl’s Law
❑ Sequential part can limit speedup
❑ Example: 100 processors, 90× speedup?
l Told = Tparallelizable + Tsequential
l Tnew = Tparallelizable/100 + Tsequential
1
l Speedup = = 90
(1− Fparallelizable ) + Fparallelizable /100
l Solving: Fparallelizable = 0.999
7
Scaling Example
❑ Workload: sum of 10 scalars, and 10 × 10 matrix
sum
l Speed up from 10 to 100 processors
❑ Single processor: Time = (10 + 100) × tadd
❑ 10 processors
l Time = 10 × tadd + 100/10 × tadd = 20 × tadd
l Speedup = 110/20 = 5.5 (5.5/10 = 55% of potential)
❑ 100 processors
l Time = 10 × tadd + 100/100 × tadd = 11 × tadd
l Speedup = 110/11 = 10 (10/100 = 10% of potential)
❑ Assumes load can be balanced across
processors 8
Scaling Example (cont)
9
Strong vs Weak Scaling
- Time = 20 × tadd
l 100 processors ↑, 32 × 32 matrix ↑
- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
l Constant performance in this example
10
§7.3 Shared Memory Multiprocessors
7.3. Shared Memory Multiprocessors
11
Shared Memory Arch: UMA
Bus contention
M M
12
Shared Memory Arch: NUMA
Bus
13
Shared Memory Arch: NUMA
P P P P P P
C C C C C C
M M M M M M
Bus
14
Example:Sun Fire V210 / V240 Mainboard
15
Example:Dell PowerEdge R720
16
Example: Sum Reduction
sum[P0] = 0;
sum[Pn] = 0;
for (i=1000*P0;i<1000*(P0+1);i++)
sum[P0] = sum[P0] for (i=1000*Pn;i<1000*(Pn+1); i++)
+ A[i];
sum[Pn] = sum[Pn] + A[i];
0 1 2 3
(half is odd) 0 1 2 3 4 5 6
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; 18
§7.4 Clusters and Other Message-Passing Multiprocessors
7.4. Clusters and Other Message-Passing Multiprocessors
19
Loosely Coupled Clusters
1,000 numbers
sum = 0;
❑ Reduction for (i=0;i<1000;i++)
sum = sum + AN[i];
l Half the processors send, other half receive and add
l The quarter send, quarter receive and add, …
21
Sum Reduction (Again)
half =7 0 1 2 3 4 5 6
l Send/receive also provide synchronization
l Assumes send/receive take similar time to addition
22
Message Passing Systems
❑ONC RPC, CORBA, Java RMI, DCOM, SOAP, .NET
Remoting, CTOS, QNX Neutrino RTOS, OpenBinder, D-
Bus, Unison RTOS
❑ Message passing systems have been called "shared
nothing" systems (each participant is a black box).
❑Message passing is a type of communication between
processes or objects in computer science
❑ Opensource: Beowulf, Microwulf
26
Coarse-grain multithreading
Same
thread
28
Fine-grain multithread
30
Simultaneous Multithreading
❑ In multiple-issue dynamically scheduled processor
l Schedule instructions from multiple threads
l Instructions from independent threads execute when function
units are available
l Within threads, dependencies handled by scheduling and
register renaming
34
Simultaneous Multi-Threading
PCA
Fetch Logic
Fetch
Write Logic
Mem Logic
Logic
PCB
35
Multithreading Example
37
Future of Multithreading
❑ Will it survive? In what form?
❑ Power considerations simplified microarchitectures
l Simpler forms of multithreading
38
§7.6 SISD, MIMD, SIMD, SPMD, and Vector
7.6. Instruction and Data Streams
❑ An alternate classification
Data Streams
Single Multiple
Instruction Single SISD: SIMD: SSE
Streams Intel Pentium 4 instructions of x86
Multiple MISD: MIMD:
No examples today Intel Xeon e5345
39
Single Instruction, Single Data
41
Multi Instruction, Multi Data
HP/Compaq
IBM POWER5 Intel IA32
Alphaserver
43
Single Instruction, Multiple Data
❑ Reduced instruction
control hardware
❑ Works best for highly
data-parallel applications, high degree of
regularity, such as graphics/image processing 44
Single Instruction, Multiple Data
45
Single Instruction, Multiple Data
ILLIAC IV MasPar
Thinking Machines
GPU
CM-2
46
Vector Processors
48
Vector vs. Scalar
❑ Vector architectures and compilers
l Simplify data-parallel programming
l Explicit statement of absence of loop-carried dependences
- Reduced checking in hardware
l Regular access patterns benefit from interleaved and burst
memory
l Avoid control hazards by avoiding loops
49
§7.7 Introduction to Graphics Processing Units
7.7. History of GPUs
3D graphics processing
• Originally high-end computers (e.g., SGI)
• Moore’s Law lower cost, higher density
• 3D graphics cards for PCs and game consoles
50
Graphics in the System
51
GPU Architectures
❑ Processing is highly data-parallel
l GPUs are highly multithreaded
l Use thread switching to hide memory latency
- Less reliance on multi-level caches
l Graphics memory is wide and high-bandwidth
❑ Trend toward general purpose GPUs
l Heterogeneous CPU/GPU systems
l CPU for sequential code, GPU for parallel code
❑ Programming languages/APIs
l DirectX, OpenGL
l C for Graphics (Cg), High Level Shader Language
(HLSL)
Heterogeneous: không đồng nhất
l Compute Unified Device Architecture (CUDA) 52
Example: NVIDIA Tesla
Streaming
Multiprocessor
8 × Streaming
processors
53
Example: NVIDIA Tesla
❑ Streaming Processors
l Single-precision FP and integer units
l Each SP is fine-grained multithreaded
54
§7.8 Introduction to Multiprocessor Network Topologies
7.8. Interconnection Networks
❑ Network topologies
l Arrangements of processors, switches, and links
Bus Ring
N-cube (N = 3)
2D Mesh
Fully connected
56
Network Characteristics
❑ Performance
l Latency per message (unloaded network)
l Throughput
- Link bandwidth
- Total network bandwidth
- Bisection bandwidth
l Congestion delays (depending on traffic)
❑ Cost
❑ Power
❑ Routability in silicon
58
§7.9 Multiprocessor Benchmarks
7.9. Parallel Benchmarks
61
Roofline Diagram
Attainable GPLOPs/sec
= Max ( Peak Memory BW × Arithmetic Intensity, Peak FP Performance )
62
Comparing Systems
63
Optimizing Performance
❑ Optimize FP performance
l Balance adds & multiplies
l Improve superscalar ILP and use of
SIMD instructions
64
Optimizing Performance
◼ Arithmetic intensity is
not always fixed
◼ May scale with
problem size
◼ Caching reduces
memory accesses
◼ Increases arithmetic
intensity
65
§7.11 Real Stuff: Benchmarking Four Multicores …
7.11. Four Example Systems
2 × quad-core
Intel Xeon e5345
(Clovertown)
2 × quad-core
AMD Opteron X4 2356
(Barcelona)
66
Four Example Systems
2 × oct-core
Sun UltraSPARC
T2 5140 (Niagara 2)
2 × oct-core
IBM Cell QS20
67
And Their Rooflines
❑Kernels
l SpMV (left)
l LBHMD (right)
❑Some optimizations
change arithmetic
intensity
❑x86 systems have
higher peak GFLOPs
l But harder to achieve,
given memory
bandwidth
68
Performance on SpMV
❑ Arithmetic intensity
l 0.166 before memory optimization, 0.25 after
◼ Xeon vs. Opteron
◼ Similar peak FLOPS
◼ Xeon limited by shared FSBs
and chipset
◼ UltraSPARC/Cell vs. x86
◼ 20 – 30 vs. 75 peak GFLOPs
◼ More cores and memory
bandwidth
69
Performance on LBMHD
❑ Arithmetic intensity
l 0.70 before optimization, 1.07 after
◼ Opteron vs. UltraSPARC
◼ More powerful cores, not
limited by memory bandwidth
◼ Xeon vs. others
◼ Still suffers from memory
bottlenecks
70
Achieving Performance
71
§7.12 Fallacies and Pitfalls
7.12. Fallacies
❑ Amdahl’s Law doesn’t apply to parallel computers
l Since we can achieve linear speedup
l But only on applications with weak scaling
73
Pitfalls
❑ Not developing the software to take account of a
multiprocessor architecture
l Example: using a single lock for a shared composite resource
- Serializes accesses, even if they could be done in parallel
- Use finer-granularity locking
74
§7.13 Concluding Remarks
7.13. Concluding Remarks
75