You are on page 1of 35

Multiprocessor Architectures

David Gregg
Department of Computer Science
University of Dublin, Trinity College

1
Multiprocessors
• Machines with multiple processors are
what we often think of when we talk of
parallel computing
• Multiple processors cooperate and
communicate to solve problems fast
• Multiple processor architecture
consists of
– Individual processor computer architecture
– Communication architecture
2
All kinds of everything
• Two kinds of physical memory organization:
• Physically centralized memory
– Allows only a few dozen processor chips
• Physically distributed memory
– Larger number chips and cores
– Simpler hardware
– More memory bandwidth
– More variable memory latency
• Latency depends heavily on distance to memory
3
Centralized vs. distributed memory

Scale
P1 Pn P1 Pn

€ $ $ $
Mem Mem
Inter
connection network
Inter
connection network
Mem Mem

Centralized Memory Distributed Memory

4
More kinds of everything
• Two logical views of memory
– Logically shared memory
– Logically distributed memory
• Logical view does not have to follow physical
implementation
– Logically shared memory can be implemented on
top of a physically distributed memory
• Often with hardware support
– Logically distributed memory can be built on top of
a physically centralized memory

5
Logical view of memory
• Logically shared memory programming
model
– communication between processors uses
shared variables and locks
– e.g. OpenMP, pthreads, Java threads
• Logically distributed memory
– communication between processors uses
message passing
– e.g. MPI, Occam

6
UMA versus NUMA
• Shared memory may be physically
centralized or physically distributed
• Physically centralized
– All processors have equal access to memory
– Symmetric multiprocessor model
– Uniform memory access (UMA) costs
– Scales to a few dozen processors
– Cost increases rapidly as number of
processors increases

7
UMA versus NUMA (contd.)
• Physically distributed shared memory
– Each processor has its own local memory
– Cost of accessing local memory is low
– But address space is shared, so each
processor can access any memory in system
– Accessing memories of other processors
has higher cost than local memory
– Some memories may be very distant
– Non-uniform memory access (NUMA) costs

8
Summary
• Four main categories of multiprocessor
– Physically centralized logically shared memory
• UMA, symmetric multiprocessor
– Physically distributed logically shared
• NUMA
– Physically distributed logically distributed memory
• Message passing parallel machines
• Most supercomputers today
– Physically centralized logically distributed memory
• No specific machines designed to do this
• Significant cost of building a centralized memory
• But programming in languages with message passing model
is not unknown on SMP machines.
9
Symmetric Multiprocessors (SMP)
• Common type of shared memory machine
– Physically centralized memory
– Logically shared address space
• Local caches are used to reduce bus
traffic to the central memory
• Programming is relatively simple
– Parallel threads sharing memory
– Memory access costs uniform
– Need to avoid cache misses, like sequential
programming
10
SMP Multi-core
• Symmetric multi-processing is the most
common model for multi-core processors
• Multi-core is a relatively new term
– Previous name was always chip
multiprocessor (CMP)
• Multiple cores on a single chip, sharing a
single external memory
– with caches to reduce memory traffic

11
E.g. Intel i7 Processor

12
E.g. Intel i7 Processor

13
E.g. Intel i7 Processor
• Family of processors, we consider one
example
• Four cores per chip
– Each core has own L1 and L2 cache
• 4 X 256K
– Single shared L3 cache
• 8MB
• External bus from L3 cache to external
memory
14
Other SMP Multi-cores
• Another common pattern is to reuse
configurations from chips with fewer
cores
• E.g. Dual-core processors
– Popular to take a dual-core processor and
replicate it on a chip
– E.g. Four cores
– Each has its own L1 cache
– Each pair has a shared L2 cache
– L2 caches connected to memory 15
SMP Multi-cores
• SMP multi-cores scale pretty well but
they may not scale forever
• Immediate issue: cache coherency
– Cache coherency is much simpler on a single
chip than between chips
– But it is still complex and doesn’t scale well
– In the medium to long term we may see
more multi-core processors that do not
share all their memory
• E.g. Cell B/E or Movidius Myriad; but how do we
program these? 16
SMP Multi-cores
• Immediate issue: memory bandwidth
– As the number of cores rises, the amount of
memory bandwidth needed will also rise
• Moore’s law means that number of cores can double every
18-24 months (at least in the medium term)
• But number of pins is limited
– Experimental architectures put pins through chip
to stacked memory
– Latency can be traded for bandwidth
– Bigger on-chip caches
• Caches may be much larger and much slower in future
• Possible move to placing processor in memory?
17
Multi-chip SMP
• Traditional type of SMP machine
• More difficult to build than multi-core
– Running wires within a chip is cheap
– Running wires between chips is expensive
• Caches are essential to lower bus traffic
• Must provide hardware to ensure that caches
and memory are consistent (cache coherency)
– Expensive across multiple chips
• Must provide a hardware mechanism to
support process synchronization
18
Multi-chip SMP
• Why multi-chip?
– Why not just build a bigger chip?
– More cores on one chip with no expensive inter-
chip communication
• Yield in chip manufacture
– The VLSI process can create very large chips
– The process creates a big
“wafer” of silicon
– The wafer is then cut into
individual chips
– But we could make fewer big
chips from the same wafer 19
Multi-chip SMP
• But silicon wafers are physical things
– There are physical imperfections in the wafer that cause that
part of the wafer to not function correctly
– An imperfection may cause the entire chip containing the
imperfection to fail
– We could make one chip from
the entire wafer, but this would
probably contain an imperfection
somewhere, so we would get zero
working chips
– By cutting the wafer into smaller
chips, each imperfection in the
silicon causes only one small chip
to be defective
20
Multi-chip SMP

Processor Processor … Processor

Cache Cache … Cache

Single Bus

Memory I/O

21
Multi-chip SMP
• Whether the SMP is single or multiple
chip, the programming model is the same
• But performance trade-offs may be
different
– Communication may be much more expensive
in multi-chip SMP

22
Multi-chip SMP
• Multiple different processors sharing
the same memory tends to cause
bottlenecks
– Memory access time (i.e. latency) becomes
slow for all processors
– Shared memory bandwidth becomes a
bottleneck
• Computers are physical things, and a
single physical shared memory is a
problem
23
Multi-chip SMP with Bus

Processor Processor … Processor

Cache Cache … Cache

Single Bus
Bus is a
bottleneck
Memory

24
Multi-chip SMP memory controller

Processor Processor … Processor

Cache Cache … Cache

Memory controller
Arbitration
is a bottleneck

Memory

25
Multi-chip SMP
• An alternative to having a single physical
memory shared between all processors is
a distributed memory
– Many arrangements are possible
• Multiple processors and multiple
memories
– All-to-all connections
– Crossbar
– Multiple buses
• Similar to distributed memory machines 26
Multi-chip SMP with All-to-All
Connections
Processor Processor … Processor

Cache Cache … Cache


One example:
All processors
connected to
all memories.
Memory Memory … Memory

27
Multi-chip NUMA
• Distributed memory in a shared-memory
machine tends to lead to NUMA
• Symmetric multiprocessor (SMP)
– Means memory access time to a given piece
of memory is the same from all processors
• Non-uniform memory access time
– Means memory access time to a given piece
of memory can differ between processors

28
One possible NUMA machine
Processor Processor … Processor

Cache Cache … Cache

Fast access to
local memory

Memory Memory … Memory

Single Bus

Slower access to non-local memory 27


Multi-chip NUMA
• Distributed shared memory tends to lead to
NUMA because
– Computers are physical things
– Easy to pair processors and memory
• Greatly reduces local memory access time
– We mass produce single processor machines
• So we can use off the shelf components for local
memory access
– We can use the local memory as a cache for non-
local memory
• Greatly reduces access time and non-local memory
bandwidth
30
Shared memory and 3D stacking
• Most current multicore processors have a
single shared SMP memory
– The distance between different cores on the
processor is much smaller than the distance
between any core and memory
– It mostly does not make sense to have non-
uniform memory access times within a single
chip
– But this may all change with 3D stacking

31
Shared memory and 3D stacking
• 3D stacking
– Place the memory chip directly on top of the
processor
– “Through-silicon via” (TSV) is a tiny
connector on the face of each chip
– Hundreds of TSVs connecting processor and
memory
– Processor and memory communicate along
TSVs

32
Shared memory and 3D stacking

Memory View from


the side
Processor

• With 3D stacking processor and memory


are extremely close
• Access time is very low
• Distance between core and nearby memory
may be smaller than between distant cores
33
Shared memory and 3D stacking

Memory View from


the side
Processor

• It seems likely that multicore processors


with 3D stacked memory will eventually
switch from SMP to NUMA
• Advantages of fast access to nearest
memory are probably large
34
2.5D stacking

Memory View from


the side
Processor

• True 3D stacking remains rare, but it is


increasingly common to put multiple chips
in a single package
• Chips are stacked in a package but
connected at the edges, not using
through-silicon vias
35

You might also like