You are on page 1of 52

Multiprocessors: Basics, Cache coherence,

Synchronization, and Memory consistency

Z. Jerry Shi
Assistant Professor of Computer Science and Engineering
University of Connecticut

* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*

Parallel Computers

• Definition: “A parallel computer is a collection of processing elements that


cooperate and communicate to solve large problems fast.”
– Almasi and Gottlieb, Highly Parallel Computing ,1989
• Questions about parallel computers:
– How large a collection?
– How powerful are processing elements?
– How do they cooperate and communicate?
– How are data transmitted?
– What type of interconnection?
– What are HW and SW primitives for programmer?
– Does it translate into performance?

1
Parallel Processors “Religion”

• The dream of computer architects since 1950s: replicate


processors to add performance vs. design a faster processor
• Led to innovative organization tied to particular programming
models since “uniprocessors can’t keep going”
– e.g., uniprocessors must stop getting faster due to limit of speed of
light: 1972, … , 1989, NOW
– Borders religious fervor: you must believe!
– Fervor damped somewhat when 1990s companies went out of
business: Thinking Machines, Kendall Square, ...
• Arguments: “pull” of opportunity of scalable performance; the
“push” of uniprocessor performance plateau?

What Level Parallelism?

• Bit level parallelism: 1970 to ~1985


– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level parallelism (ILP):
~1985 through today
– Pipelining
– Superscalar
– VLIW
– Out-of-Order execution
– Limits to benefits of ILP?
• Process Level or Thread level parallelism; mainstream for general purpose
computing?
– Servers are parallel
– Many PCs too!

2
Why Multiprocessors?

• Microprocessors are the fastest CPUs – leverage them


– Collecting several much easier than redesigning one
• Complexity of current microprocessors
– Do we have enough ideas to sustain 1.5X/yr?
– Can we deliver such complexity on schedule?
• Slow (but steady) improvement in parallel software (scientific
apps, databases, OS)
• Emergence of embedded and server markets driving
microprocessors in addition to desktops
– Embedded functional parallelism, producer/consumer model
– Server figure of merit is tasks per hour vs. latency
• Long term goal: scale number of processors to size of budget and
desired performance

Parallel Architecture

• Parallel Architecture extends traditional computer architecture


with a communication architecture
– Abstractions (HW/SW interface)
– Organizational structure to realize abstraction efficiently

3
Performance Metrics: Latency and Bandwidth

• Bandwidth
– Need high bandwidth in communication
– Match limits in network, memory, and processor
– Challenge is link speed of network interface vs. bisection bandwidth of
network
• Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since it requires more thought to overlap
communication and computation
– Overhead to communicate is a problem in many machines
• Latency Hiding
– How can a mechanism help hide latency?
– Increases programming system burden
– Examples: overlap message send with computation, prefetch data, switch
to other tasks

Popular Flynn Categories (1966)

• SISD (Single Instruction Stream, Single Data Stream)


– Uniprocessors
• SIMD (Single Instruction Stream, Multiple Data Stream)
– Sing instruction memory and control processor (e.g., Illiac-IV, CM-2,
vector machines)
• Simple programming model
• Low overhead
• Requires lots of data parallelism
• All custom integrated circuits
• MISD (Multiple Instruction Stream, Single Data Stream)
– No commercial multiprocessor of this type has been built
• MIMD (Multiple Instruction Stream, Multiple Data Stream)
– Examples: IBM SP, Cray T3D, SGI Origin, your PC?
• Flexible
• Use off-the-shelf microprocessors
• MIMD is the current winner: Concentrate on major design emphasis <= 128
processor MIMD machines

4
MIMD: Centralized Shared Memory

• UMA: Uniform Memory Access (latency)


• Symmetric (shared-memory) Multi-Processor (SMP)

MIMD: Distributed Memory

• Advantages: More memory bandwidth (low cost), lower memory latency (local)
• Drawback: Longer communication latency, More complex software model

5
Distributed Memory Versions

• Distributed Shared Memory (DSM)


– Single physical address space with load/store access
• Address space is shared among nodes
– Also called Non-Uniform Memory Access (NUMA)
• Access time depends on the location of the data word in memory
• Message-Passing Multicomputer
– Separate address space per processor
• The same address on different nodes refers to different words
– Could be completely separated computers connected on LAN
• Also called clusters
– Tightly-coupled cluster

Layers of Parallel Framework

• Programming Model:
– Multiprogramming : lots of jobs, no communications
– Shared memory: communicate via memory
– Message passing: send and receive messages
– Data Parallel: several agents operate on several data sets
simultaneously and then exchange information globally and
simultaneously (shared or message passing)
• Communication Abstraction:
– Shared address space, e.g., load, store, atomic swap
– Message passing, e.g., send, receive library calls
– Debate over this topic (ease of programming, scaling) => many
hardware designs 1:1 programming model

6
Data Parallel Model

• Operations can be performed in parallel on each element of a


large regular data structure, such as an array
• 1 Control Processor broadcast to many PEs
– When computers were large, could amortize the control portion of
many replicated PEs
• Condition flag per PE so that PE can be skipped
• Data distributed in each memory
• Early 1980s VLSI => SIMD rebirth: 32 1-bit procs + memory on
a chip was the PE
• Data parallel programming languages lay out data to processors

Communication Mechanisms

• Shared memory: Memory load and store


– Oldest, but most popular model
• Message passing: Send and receive messages
– Sending messages that request action or deliver data
– Remote procedure call (RPC)
• Request the data and/or operations performed on data
• Synchronous: Request, wait, and continue
• Three performance metrics
– Communication bandwidth
– Communication latency
– Communication latency hiding
• Overlap communication with computation or with other
communication

7
Shared Memory Model

• Communicate via Load and Store


• Based on timesharing: processes on multiple processors vs.
sharing single processor
• Process: a virtual address space and about one thread of control
– Multiple processes can overlap (share), but ALL threads share a
process address space
– Threads may be used in a casual way to refer to multiple loci of
execution, even when they do not share an address space
• Writes to shared address space by one thread are visible to reads
of other threads
– Usual model: share code, private stack, some shared heap, some
private heap

Message Passing Model

• Whole computers (CPU, memory, I/O devices) communicate as


explicit I/O operations
• Send specifies local buffer + receiving process on remote
computer
• Receive specifies sending process on remote computer + local
buffer to place data
– Usually send includes process tag and receive has rule on tag (i.e.
match 1, match any)
– Synchronize: When send completes? When buffer free? When
request accepted? Receive wait for send?
• Send+receive => memory-memory copy, where each supplies
local address, AND does pairwise sychronization!

8
Advantages of Shared-Memory Communication

• Compatibility
– Oldest, but the most popular model
• Ease of programming
– Communication patterns among processors are complex and
dynamic
– Simplify compilers
• Use familiar shared-memory model to develop applications
– Attention only on performance critical accesses
• Lower overhead for communications and better use of bandwidth
when communicating small items
• Can use hardware-controlled caching to reduce the frequency of
remote communication

Advantages of Message-Passing Communication

• Hardware is simpler
• Communication is explicit
– It is easier to understand (when, cost, etc.)
– In shared memory it can be hard to know when communicating and
when not, and how costly it is
• Force programmer to focus on communication, the costly aspects
of parallel computing
• Synchronization is naturally associated with sending messages
– Reducing the possibility for errors introduced by incorrect
synchronization
• Easier to use sender-initiated communication
– May have some advantages in performance

9
Communication Options

• Supporting message passing on top of shared memory


– Sending a message == Copying data
• Data may be misaligned though
• Supporting shared memory on top of hardware for message
passing
– The cost of communication is too expensive
• Loads/stores usually move small amounts of data
– Shared virtual memory may be a possible direction
• What’s the best choice for massively parallel processors (MPP)?
– Message passing is clearly simpler
– Shared-memory has been widely supported since 1995

Amdahl’s Law and Parallel Computers

• Amdahl’s Law:
1
Speedup =
FracX
( + (1 − FracX ))
SpeedupX
• The sequential portion limits parallel speedup
– Speedup <= 1/ (1 - FracX)
• The large latency of remote access is another major challenge

Example: What fraction sequential to get 80x speedup from 100 processors?
Assume either 1 processor or 100 fully used.
80 = 1 / ((FracX/100 + (1-FracX))
0.8 · FracX + 80 · (1 - FracX) = 80 - 79.2 · FracX = 1
FracX = (80-1) / 79.2 = 0.9975
Only 0.25% sequential!

10
But In The Real World...

• Wood and Hill [1995] compared cost-effectiveness of


multiprocessor systems
– Challenge DM (1 proc, 6GB memory)
• costDM(1, MB) = $38,400 + $100 × MB
• costDM(1, 1GB) = $140,800
– Challenge XL (32 proc)
• costXL(p, MB) = $81,600 + $20,000 × p + $100 × MB
• costXL(8, 1GB) = $344,000
= 2.5x costDM(1,1GB)
– So, XL speedup of 2.5x or more makes the XL more cost effective!

Memory System Coherence

• Cache coherence challenge for hardware


• Memory system is coherent if any read of a data item returns the
most recently written value of that data item
– Coherence: What value can be returned by a read
• Defines the behavior of reads and writes with respect to the same
memory location
• Ensures the memory system is correct.
– Consistency: When written values will be returned by reads (all
locations)
• Defines the behavior of accesses with respect to accesses to other
memory locations

11
Multiprocessor Coherence Rules

• A memory system is coherent if


• Processor P writes X
... No other writes to X
Processor P reads X
Value must be the same
• Processor Q writes X
... Sufficient time
... No other writes to X
Processor P reads X
Value must be the same
• Processor P writes X
Processor Q writes X
All processors see it as P, Q or Q, P, but the order should be the same for
all processors
Wirtes to the same location are serialized

Potential HW Coherency Solutions

• Snooping Solution (Snoopy Bus):


– Send all requests for data to all processors
– Processors snoop to see if they have a copy and respond accordingly
– Requires broadcast, since caching information is at processors
– Works well with bus (natural broadcast medium)
– Dominates for small scale machines (most of the market)
• Directory-Based Schemes (discussed later)
– Keep track of what is being shared in a centralized place (logically)
– Distributed memory => distributed directory for scalability
(avoids bottlenecks)
– Send point-to-point requests to processors via network
– Scales better than Snooping
– Actually existed BEFORE Snooping-based schemes

12
Bus Snooping Topology

• Memory: centralized with uniform access time (“UMA”) and bus interconnect
• Symmetric Multiprocessor (SMP)

Basic Snooping Protocols

• Write Invalidate Protocol:


– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which
snoop and invalidate any copies
– Read Miss:
• Write-through: memory is always up-to-date
• Write-back: snoop in caches to find most recent copy
• Write Broadcast/Update Protocol (typically write through):
– Write to shared data: broadcast on bus, processors snoop, and
update any copies
– Read miss: memory is always up-to-date
• Write serialization: bus serializes requests!
– Bus is single point of arbitration

13
Basic Snooping Protocols

• Write Invalidate versus Broadcast:


– Invalidate requires one transaction per write-run
– Invalidate uses spatial locality
• Invalidate a block each time
– Broadcast has lower latency between write and read

An Example Snooping Protocol

• Invalidation protocol, write-back cache


• Each block of memory is in one state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data
• Read misses: cause all caches to snoop bus
• Writes to clean line are treated as misses

14
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests CPU Read Shared
for each Invalid (read/only)
cache block Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus

Snoopy-Cache State Machine-II

• State machine Write miss


for bus requests Invalid for this block Shared
for each (read/only)
cache block

Write miss
for this block Read miss
Write Back Read miss for this block
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)

15
Example
Processor 1 Processor 2 Bus Memory
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

Example: Step 1

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Active arrow = Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

16
Example: Step 2

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1

P2: Write 20 to A1
P2: Write 40 to A2

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

Example: Step 3

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2. miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

17
Example: Step 4

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 10
10

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map
Invalid Shared CPU Read Miss
to same cache block,
Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

Example: Step 5

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2: Write 40 to A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20

Assumes initial cache state Remote Write CPU Read hit


is invalid and A1 and A2 map Shared
Invalid CPU Read Miss
to same cache block, Read
but A1 != A2 miss on bus
Write
Remote miss on bus CPU Write
Write Remote Read Place Write
Write Back Write Back Miss on Bus

Exclusive
CPU read hit CPU Write Miss
CPU write hit Write Back

18
False Sharing

P1 P2
for( i=0; ; i++ ) { for( j=0; ; j++ ) {
... ...
*a = i;
*(a + 1) = j;
...
}
...
}

state

cache block

False Sharing Quiz

• x1 and x2 share a cache block, which starts in the shared state. What
happens next?

Time P1 P2 True or False?

1 Write x1 ?

2 Read x2

3 Write x1

4 Write x2

5 Read x2

19
False Sharing Quiz

• x1 and x2 share a cache block, which starts in the shared state. What
happens next?

Time P1 P2 True or False?

1 Write x1 True

2 Read x2 ?

3 Write x1

4 Write x2

5 Read x2

False Sharing Quiz

x1 and x2 share a cache block, which starts


in the shared state. What happens next?

Time P1 P2 True or False?

1 Write x1 True

2 Read x2 False

3 Write x1 ?

4 Write x2

5 Read x2

20
False Sharing Quiz

x1 and x2 share a cache block, which starts


in the shared state. What happens next?

Time P1 P2 True or False?

1 Write x1 True

2 Read x2 False

3 Write x1 False

4 Write x2 ?

5 Read x2

False Sharing Quiz

x1 and x2 share a cache block, which starts


in the shared state. What happens next?

Time P1 P2 True or False?

1 Write x1 True

2 Read x2 False

3 Write x1 False

4 Write x2 False

5 Read x2 ?

21
False Sharing Quiz

x1 and x2 share a cache block, which starts


in the shared state. What happens next?

Time P1 P2 True or False?

1 Write x1 True

2 Read x2 False

3 Write x1 False

4 Write x2 False

5 Read x2 True

Larger MPs

• Separate Memory per Processor


• Local or Remote access via memory controller
• One Cache Coherency solution: non-cached pages
• Alternative: directory per cache that tracks state of every block in
every cache
– Which caches have copies of block, dirty vs. clean, ...
• Info per memory block vs. per cache block?
– PLUS: In memory => simpler protocol (centralized/one location)
– MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache
size)
• Prevent directory as bottleneck?
distribute directory entries with memory, each keeping track of
which processors have copies of their blocks

22
Distributed Directory MPs

Directory Protocol

• States identical to Snooping Protocol:


– Shared: ≥ 1 processors have data; memory is up-to-date
– Invalid: (no processor has it; not valid in any cache)
– Exclusive: 1 processor (owner) has data; memory is out-of-date
• In addition to cache state, must track which processors have data
when in the shared state (usually a bit vector; 1 if processor has a
copy)

23
Directory-based cache coherence

• States identical to snooping case; transactions very similar.


• Three processors typically involved:
– Local node - where a request originates
– Home node - where the memory location
of an address resides
– Remote node - has a copy of a cache
block, whether exclusive or shared
• Generates read miss & write miss message to home directory.
• Write misses that were broadcast on the bus for snooping =>
explicit invalidate & data fetch requests.

CPU-Cache State Machine CPU Read hit

• State machine
for CPU requests Invalidate
for each Shared
memory block Invalid (read/only)
CPU Read
• Invalid state Send Read Miss
if in message CPU read miss:
memory
CPU Write: Send Read Miss
Send Write Miss CPU Write:Send
msg to h.d. Write Miss message
Fetch/Invalidate to home directory
send Data Write Back message
to home directory Fetch: send Data Write Back
message to home directory
CPU read miss: send Data
Write Back message and
Exclusive read miss to home directory
(read/writ)
CPU read hit CPU write miss:
CPU write hit send Data Write Back message
and Write Miss to home
directory

24
Directory State Machine
Read miss:
Sharers += {P};
• State machine Read miss: send Data Value Reply
for Directory Sharers = {P}
requests for each send Data Value
Reply Shared
memory block Invalid (read only)

Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Reply msg
Read miss:
Write Miss: Sharers += {P};
Sharers = {P}; Exclusive send Fetch;
send Fetch/Invalidate; (read/write) send Data Value Reply
send Data Value Reply msg to remote cache
msg to remote cache (Write back block)

Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1

P1: Read A1
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block

25
Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block

Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block

26
Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 10
10
P2: Write 40 to A2 10

Write
WriteBack
Back
A1 and A2 map to the same cache block

Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 10

A1 and A2 map to the same cache block

27
Example

Processor 1 Processor 2 Interconnect Directory Memory


P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. {} 20
Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

A1 and A2 map to the same cache block

Implementation issues with a snoopy protocol

• Bus request and bus acquisition occurs in two steps


• Four possible events can occur in between:
– Write miss on the bus by another processor
– Read miss on the bus by another processor
– CPU read miss
– CPU write miss
• Interventions: Cache overriding memory

28
Snoopy-Cache State Machine-I
CPU Read hit
• State machine
for CPU requests CPU Read Shared
for each Invalid (read/only)
cache block Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
(read/write)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus

Implementation issues with a directory-based protocol

• Assumptions about networks


– In-order delivery between a pair of nodes
– Messages can always be injected
– Network-level deadlock freedom
• To ensure system-level deadlock freedom
– Separate network for requests and replies
– Allocates space for replies upon sending out requests
– Requests can be NACK, but not replies
– Retry NACKed requests

29
Memory Consistency

• Adve and Gharacharloo, “Shared Memory Consistency Models: A Tutorial”,


WRL Research Report 95/7.

Consistency vs. Coherency: Recap

• Different memory locations vs. same memory location


• Needs consistency models regardless of whether shared data is
cached

30
Memory Consistency Models

• Cache coherence ensures multiple processors see a consistent view of


memory, but how consistent.
– When must a processor see a value written by another processor?
– In what order must a processor observe the writes of another processor?
Example:
P1: A = 0; P2: B = 0;
..... .....
A = 1; B = 1;
L1: if (B == 0) ... L2: if (A == 0) ...
• Impossible for both if statements L1 & L2 to be true?
– What if write invalidate is delayed & processor continues?
– Should this behavior be allowed? Under what conditions?
• Memory consistency models:
what are the rules for such cases?

Sequential Consistency

• The result of any execution is the same as if


– The operations of all the processors were executed in some
sequential order, and
– The operations of each individual processor appear in this sequence
in the order specified by its program.

• Intuitive: maps well with uniprocessor

• Implementations
– Delay memory access completion until all invalidates are done
– Or, delay next memory access until previous one is done

• Severe performance restrictions

31
How uniprocessor hardware optimizations of memory system
break sequential consistency + solution {No caches}

• (RAW) Reads bypassing writes: Write buffers


• Code example:

• Solution: No buffering of writes – Add delays

Animations of reads bypassing writes ☺

1. P1: Flag1=1
P1 P2 2. P2: Flag2=1
1
3. P1: if (Flag2==0)
Flag1=1 4. P2: if (Flag1==0)

bus

Flag1:0
Flag2:0

32
Animations ☺

1. P1: Flag1=1
P1 P2 2. P2: Flag2=1
1 2
3. P1: if (Flag2==0)
Flag1=1 Flag2=1 4. P2: if (Flag1==0)

bus

Flag1:0
Flag2:0

Animations ☺

1. P1: Flag1=1
P1 P2
1 2
2. P2: Flag2=1
3 3. P1: if (Flag2==0) true!
Flag1=1 Flag2=1 4. P2: if (Flag1==0)
bus

Flag1:0
Flag2:0

33
Animations ☺

1. P1: Flag1=1
P1 P2
1 2
2. P2: Flag2=1
3 4
3. P1: if (Flag2==0) true!
Flag1=1 Flag2=1 4. P2: if (Flag1==0) true!
bus

Flag1:0
Flag2:0

How uniprocessor optimizations of memory system break


sequential consistency + solution {No caches}

• (WAW) Write ordering


– Overlapping writes in the network
– Write merging into a cache line
• Code example:

• Solution:
– Wait till write operation has reached memory module before
allowing next write – acknowledgment for writes – more delay

34
Animations of writes bypassing writes

1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0);
4. P2: … = Data
P2

Data: 0
M3: Head: 0 M4:

Animations of writes bypassing writes

1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0);
2 2
4. P2: … = Data
P2

Data: 0
M3: Head: 0 M4:

35
Animations of writes bypassing writes

1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0); false
2 2 4. P2: … = Data
P2 3

Data: 0
M3: Head: 0 M4:

Animations of writes bypassing writes

1. P1: Data=2000
P1
1 2. P1: Head=1
3. P2: while (Head==0); true
2 2
3 4. P2: … = Data Data=0!
P2
4
Data: 0
M3: Head: 0 M4:

36
How uniprocessor optimizations of memory system break
sequential consistency {No caches}

• (RAR, WAR) Non-blocking reads


– Speculative execution
• Code example:

• Solution:
– Disable non-blocking reads – more delay

Animations of reads bypassing reads

1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0);
4. P2: … = Data
P2

Data: 0
M3: Head: 0 M4:

37
Animations of reads bypassing reads

1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0);
2 4. P2: … = Data Data=0
2 2 P2

Data: 0
M3: Head: 0 M4:

Animations of reads bypassing reads

1. P1: Data=2000
P1 2. P1: Head=1
1 3. P2: while (Head==0); Ld miss:
Head=0
2 2 2
P2 4. P2: … = Data Data=0
3
Data: 0
M3: Head: 0 M4:

38
Now, let’s add caches.. More problems

• (WAW) P1’s write to Data reaches memory, but has not yet been
propagated to P2 (cache coherence protocol’s message (inv. or
update) has not yet arrived)
• Code Example:

• Solution: Wait for acknowledgment of update/invalidate before


any write/read – even longer delay than just waiting for ack from
memory module.

Animations of problems with a cache

1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 4. P2: … = Data
Data=0

Data: 2000
M3: Head: 0 M4:

39
Animations of problems with a cache

1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 2 4. P2: … = Data
2
Data=0

Data: 2000
M3: Head: 1 M4:

Animations of problems with a cache

1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0); false
P2 2 2 4. P2: … = Data
Data=0
3
Data: 2000
M3: Head: 1 M4:

40
Animations of problems with a cache

1. P1: Data=2000
P1 2. P1: Head=1
1
3. P2: while (Head==0);
P2 2 4. P2: … = Data Data=0!
2
Data=0
3
Data: 2000
M3: Head: 1 M4:

Caches + network.. headaches

• Non-atomic write – all processors have to see writes to the same


location in the same order
• Code example: writes to A may appear at P3/P4 in different order
due to different network paths

• Solution: Wait till all acks to a location is complete; or Network


preserves ordering between a node pair

41
Animations of problems with networks

A=2
A=1 1 2
M1 P1 P2
A

P3 M2 P4
B,C

Animations of problems with networks

A=2
A=1 1 2
M1 P1 P2
A

P3 M2 P4
B,C
Register1=A=1

42
Animations of problems with networks

A=2
A=1 1 2
M1 P1 P2
A

P3 M2 P4
B,C
Register1=A=1 Register2=A=2

Summary of sequential consistency

• Slow, because
– In each processor, previous memory operation in program order
must be completed before proceeding with next memory operation
– In cache-coherent systems,
• writes to the same location must be visible in the same order to all
processors; and
• no subsequent reads till write is visible to all processors.
• Mitigating solutions
– Prefetch cache ownership while waiting in write buffer
– Speculate and rollback
– Compiler memory analysis – reorder at hw/sw only if safe

43
Relaxed (Weak) Consistency Models

• Several relaxed models for memory consistency characterized by


their attitude towards RAR, WAR, RAW, and WAW to different
addresses
• Not really an issue for most programs -- they are synchronized
– A program is synchronized if all access to shared data are ordered
by synchronization operations
– write (x)
...
release (s) {unlock}
...
acquire (s) {lock}
...
read(x)
• Only those programs willing to be nondeterministic are not
synchronized, leading to data races

Synchronization

44
Synchronization

• Hardware needs to know how programmer intends different


processes to use shared data
• Issues for synchronization:
– Hardware needs to support uninterruptable instructions to fetch and
update memory (atomic operation);
– User level synchronization operation using this primitive;
– For large scale MPs, synchronization can be a bottleneck;
techniques to reduce contention and latency of synchronization

Uninterruptable Instruction to Fetch and Update Memory

• Atomic exchange: interchange a value in a register for a value in


memory
– 0 => synchronization variable is free
– 1 => synchronization variable is locked and unavailable
– Set register to 1 and swap
– New value in register determines success in getting lock
0 if you succeeded in setting the lock (you were first)
1 if another processor had already claimed access
– Key is that exchange operation is indivisible
• Test-and-set: tests a value and sets it if the value passes the test
• Fetch-and-increment: returns the value of a memory location and
atomically increments it
– 0 => synchronization variable is free

45
Load Linked, Store Conditional

• Hard to have read and write in one instruction; use two instead
• Load linked (or load locked) + store conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same memory
location since preceeding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
• Example doing fetch & increment with LL & SC:
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)

User-Level Synchronization

• Spin locks: processor continuously tries to acquire, spinning around a loop


trying to get the lock
li R2,#1
lockit: exch R2,0(R1) ;atomic exchange
bnez R2,lockit ;already locked?

• What about MP with cache coherency?


– Want to spin on cache copy to avoid full memory latency
– Likely to get cache hits for such variables
• Problem: exchange includes a write, which invalidates all other copies; this
generates considerable bus traffic
• Solution: start by simply repeatedly reading the variable; when it is free, then
try exchange (“test and test&set”):
try: li R2,#1
lockit: lw R3,0(R1) ;load var
bnez R3,lockit ;not free=>spin
exch R2,0(R1) ;atomic exchange
bnez R2,try ;already locked?

46
Scalable Synchronization

• Exponential backoff
– if you fail to get the lock, don’t try again for an exponentially
increasing amount of time
• Queuing lock
– Wait for lock by getting on the queue. Lock is passed down the
queue.
– Can be implemented in hardware or software
• Combining tree
– Implement barriers as a tree to avoid contention
• Powerful atomic primitives such as fetch-and-increment

Summary

• Caches contain all information on state of cached memory blocks


• Snooping and Directory Protocols similar; bus makes snooping
easier because of broadcast (snooping => uniform memory
access)
• Directory has extra data structure to keep track of state of all
cache blocks
• Distributing directory => scalable shared memory multiprocessor
=> Cache coherent, Non uniform memory access (cc-NUMA)
• Practical implementations of shared memory multiprocessors are
not sequentially consistent due to performance reasons –
programmers use synchronization to ensure correctness

47
IBM Power4

Case Study: IBM Power4 Caches

48
Interconnection network of buses

Case Study: IBM Power4: Cache States

• L1 cache: Writethrough
• L2 cache: Maintains coherence (invalidation-based), Writeback,
Inclusive of L1
• L3 cache: Writeback, non-inclusive, maintains coherence
(invalidation-based)
• Snooping protocol
– Snoops on all buses.

49
IBM Power4’s L2 coherence protocol

• I (Invalid state)
• SL (Shared state, in one core within the module)
• S (Shared state, in multiple cores within a module)
• M (Modified state, and exclusive)
• Me (Data is not modified, but exclusive, no write back
necessary)
• Mu [Load-linked lwarx, Store-conditional swarx]
• T (tagged: data modified, but not exclusively owned)
– Optimization: Do not write back modified exclusive data upon a
read hit

IBM Power4’s L2 coherence protocol

50
IBM Power4’s L3 coherence protocol

• States:
• Invalid
• Shared: Can only give data to its own L2s that it’s caching data
for
• Tagged: Modified and shared (local memory)
• Tagged remote: Modified and shared (remote memory)
• O (prefetch): Data in L3 = Memory, and same node, not
modified.

IBM Blue Gene/L memory architecture

51
Node overview

Single chip

52

You might also like