Professional Documents
Culture Documents
ARCHITECTURE
1
REVIEW: SHARED MEMORY VS.
MESSAGE PASSING
Loosely coupled multiprocessors
Noshared global memory address space
Multicomputer network
Network-based multiprocessors
Usually programmed via message passing
Explicit calls (send, receive) for communication
4
INTERCONNECTION SCHEMES FOR
SHARED MEMORY
Scalability dependent on interconnect
5
UMA/UCA: UNIFORM MEMORY OR
CACHE ACCESS
• All processors have the same un-contended latency to memory
• Latencies get worse as system grows
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect
M a in M e m o ry
co n t en t io n in m em o ry b a n k s
. . .
lo n g In te r c o n n e c tio n N e tw o r k
la t en cy
co n t en tio n in n et w o rk
P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
6
UNIFORM
+ MEMORY/CACHE ACCESS
Data placement unimportant/less important (easier to optimize code and make use
of available memory space)
- Scaling the system increases latencies
- Contention could restrict bandwidth and increase latency
M a in M e m o ry
co n t en t io n in m em o ry b a n k s
. . .
lo n g In te r c o n n e c tio n N e t w o r k
la t en cy
co n t en t io n in n et w o rk
P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
7
EXAMPLE SMP
Quad-pack Intel Pentium Pro
8
HOW TO SCALE SHARED MEMORY
MACHINES?
Two general approaches
Maintain UMA
Providea scalable interconnect to memory
Downside: Every memory access incurs the round-trip network latency
In te r c o n n e c tio n N e tw o r k
lo n g
co n ten tio n in n etw o rk
la ten cy
. . .
M e m o ry M e m o ry M e m o ry
s h o rt
la ten cy
P ro c e s s o r P ro c e s s o r . . . P ro c e s s o r
10
CONVERGENCE OF PARALLEL
ARCHITECTURES
Scalable shared memory architecture is similar to scalable
message passing architecture
Main difference: is remote memory accessible with loads/stores?
11
HISTORICAL EVOLUTION: 1960S & 70S
• Early MPs
– Mainframes
– Small number of processors Memory
Memory
Memory
– crossbar interconnect Memory
Memory
Memory
Memory
– UMA Memory
Processor
Processor
corssbar
Processor
Processor
12
HISTORICAL EVOLUTION: 1980S
• Bus-Based MPs
– enabler: processor-on-a-board
– economical scaling
– precursor of today’s SMPs
– UMA
14
HISTORICAL EVOLUTION: CURRENT
Chip multiprocessors (multi-core)
Small to Mid-Scale multi-socket CMPs
One module type: processor + caches + memory
Clusters/Datacenters
Use high performance LAN to connect SMP blades, racks
Driven by applications
Many more throughput applications (web servers)
… than parallel applications (weather prediction) 15
Cloud computing
HISTORICAL EVOLUTION: FUTURE
Cluster/datacenter on a chip?
Heterogeneous multi-core?
???
16
MULTI-CORE PROCESSORS
17
MOORE’S LAW
What else could you do with the die area you dedicate to multiple
processors?
Have a bigger, more powerful core
Have larger caches in the memory hierarchy
Simultaneous multithreading
Integrate platform components on chip (e.g., network interface, memory
controllers) 19
WHY MULTI-CORE?
Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more
execution units, large trace caches, large branch predictors, etc
21
MULTI-CORE VS. LARGE SUPERSCALAR
Multi-core advantages
+ Simpler cores more power efficient, lower complexity, easier to
design and replicate, higher frequency (shorter wires, smaller structures)
+ Higher system throughput on multi-programmed workloads reduced
context switches
+ Higher system throughput in parallel applications
22
Multi-core disadvantages
- Requires parallel tasks/threads to improve performance
(parallel programming)
- Resource sharing can reduce single-thread performance
- Shared hardware resources need to be managed
- Number of pins limits data supply for increased demand
23
LARGE SUPERSCALAR VS. MULTI-CORE
Olukotun et al., “The Case for a Single-Chip Multiprocessor,”
ASPLOS 1996.
Technology push
Instruction
issue queue size limits the cycle time of the superscalar,
OoO processor diminishing performance
Quadratic increase in complexity with issue width
Large, multi-ported register files to support large instruction windows
and issue widths reduced frequency or longer RF access,
diminishing performance
Application pull
Integer applications: little parallelism?
FP applications: abundant loop-level parallelism 24
Others (transaction proc., multiprogramming): CMP better fit
WHY MULTI-CORE?
Alternative: Bigger caches
25
WHY MULTI-CORE?
Alternative: (Simultaneous) Multithreading
- Scalability is limited: need bigger register files, larger issue width (and
associated costs) to have many threads complex with many threads
- Parallel performance limited by shared fetch bandwidth
- Extensive resource sharing at the pipeline and memory system reduces
both single-thread and parallel application performance
26