Computer Architecture Thread-Level Parallelism

COMPUTER ARCHITECTURE
Rajeshwari B
Department of Electronics and Communication
Engineering
Thread-Level Parallelism
Rajeshwari B
Department of Electronics and Communication Engineering
UNIT 5
• Multiprocessor Architecture: Challenges of Parallel Processing

• Centralized Shared-Memory Architectures- Cache Coherence,
• Snoopy Protocol
• Distributed Shared-Memory and Directory-Based Coherence
• Synchronization: The Basics
• Models of Memory Consistency:
Importance of Multiprocessing
• Exploiting more ILP, became inefficient, since power and silicon costs grew
faster than performance.
• A growing interest in high-end servers as cloud computing and software-as a-

service become more important.
• A growth in data-intensive applications driven by the availability of massive

amounts of data on the Internet.
• Increasing desktop performance less important
• Improved understanding in how to use multiprocessors effectively

– Especially server where significant natural TLP
-parallelism among large numbers of independent requests (request level
parallelism).
• Advantage of leveraging design investment by replication

– Rather than unique design
Exploiting TLP on Multiprocessors
• Our focus in the chapter is on tightly-coupled shared-memory multiprocessors
– A set of processors whose coordination and usage are typically controlled by a
single operating system and that share memory through a shared address space.
• Two different software models

– Parallel processing
• The execution of a tightly coupled set of threads collaborating on a single task
– Request-level parallelism
• The execution of multiple, relatively independent processes that may originate from
one or more users (called multiprogramming).
• How to exploit thread-level parallelism efficiently?

– Identified the independent threads by compiler? (X)
– identified the independent threads at a high level by the software system or
programmer (yes)
Architecture for Multiprocessors -2 Models for Communication

1. Message-passing multiprocessors: Communication occurs by explicitly passing
messages among the processors:
2. Shared memory multiprocessors: Communication occurs through a shared
address space (via loads and stores)
2 classes of multiprocessors WRT memory:

1. Centralized Memory Multiprocessor
• < few dozen cores
• Small enough to share single, centralized memory with uniform memory
access latency (UMA)
2. Physically Distributed ‐Memory Multiprocessor
• Larger number chips and cores than 1.
• BW demands Memory distributed among processors with non ‐uniform
memory access/latency (NUMA)
Shared‐Memory Multiprocessor
Symmetric multiprocessors (SMP)
– A centralized memory
– Small number of cores (< 32)
– Share single memory with uniform
memory latency
_It becomes less attractive as the
number of processors sharing centralized
memory increases
Distributed shared memory (DSM)
– Memory is distributed among

processors
– Non-uniform memory access/latency
(NUMA)
– Processors connected via direct
(switched) and non-direct (multihop)
interconnection network
–Communicating data between
processors more complex
Challenges of Parallel Processing
• The first challenge is the limited parallelism available in programs
Eg:Suppose 80X speedup from 100 processors. What fraction of
original program can be sequential?
Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of the original computation can be sequential.
Challenges of Parallel Processing
• The second challenge is the relatively high cost of communications
Eg: Suppose 32-core MP, 4GHz, 100 ns remote memory, all local accesses hit
memory hierarchy and base CPI is 0.5. What is performance impact if 0.2%
instructions involve remote access?
• Answer:
– Remote access = 100/0.25 = 400 clock cycles.
– CPI = Base CPI + Remote request rate x Remote request cost
– CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
– No communication (the MP with all local reference) is 1.3/0.5 or 2.6
faster than 0.2% instructions involve remote access
1.Application parallelism -> primarily via new algorithms that have better
parallel performance
2. Long remote latency impact -> both by architect and by the programmer
• For example, reduce frequency of remote accesses either by

– Caching shared data (HW)
– Restructuring the data layout to make more accesses local (SW)
Much of this lecture focuses on techniques for reducing the impact of long
remote latency
– How caching can be used to reduce remote access frequency, while
maintaining a coherent view of memory?
– Efficient synchronization?
– And, latency-hiding techniques
Cache Coherence Problem

Multiprocessors usually cache both private data (used by a single processor)
and shared data (used by multiple processors)
– Caching shared data can reduce latency and required memory bandwidth (due
to local access)
– But, caching shared data introduces cache coherence problem
– Processors see different values for u after event 3. (Unacceptable for programming,
and it’s frequent !! )
1. Coherence defines values returned by a read

– Write to the same location by any two processors are seen in the same
order by all processors
2. Consistency determines when a written value will be returned by a read
– If a processor writes location A followed by location B, any processor that
see the new value of B must also see the new value of A
• Coherence defines behavior of reads and writes to same location,

• Consistency defines behavior of reads and writes to other locations
Defining Coherent Memory System
1. Preserve Program Order: A read by processor P to location X that follows a write by

P to X, with no writes of X by another processor occurring between the write and the read by
P, always returns the value written by P
2. Coherent view of memory: A read by a processor to location X that follows a write by

another processor to X returns the written value if the read and write are sufficiently
separated in time and no other writes to X occur between the two accesses.
3. Write serialization: 2 writes to same location by any 2 processors are seen in the same
order by all processors.
– If not, a processor could keep value 1 since saw as last write
– For example, if the values 1 and then 2 are written to a location, processors can never read
the value of the location as 2 and then later read it as 1
Basic Schemes for Enforcing Coherence(to track the sharing status)
2 Classes of Cache Coherence Protocols

• HW cache coherence protocol
– Use hardware to track the status of the shared data
1. Directory based — Sharing status of a block of physical memory
is kept in just one location, the directory
– Centralized control protocol
2. Snooping — Every cache with a copy of data also has a copy of

sharing status of block, but no centralized state is kept
– Distributed control protocol
Snooping Coherence Protocols
• Write Invalidate Protocol: – Multiple readers, single writer – Write to

shared data: an invalidate is sent to all caches which snoop and invalidate
any copies
– Read Miss:
» Write-through: memory is always up-to-date
» Write-back: snoop in caches to find most recent copy
• Write Broadcast Protocol: – Write to shared data: broadcast on bus,
processors snoop, and update copies
– Read miss: memory is always up-to-date
• Write serialization: bus serializes requests
– Bus is single point of arbitration
(a) Invalidate protocol; (b) Update protocol for shared variables.
Basic Implementation Techniques
• Write-through: get up-to-date copy from lower level cache or
memory (simpler, but enough memory BW is necessary)
• Write-back: it is harder to get up-to-date copy of data
• Can use same snooping mechanism

1. Snoop every address placed on the bus
2. If a processor has dirty copy of requested cache block, it provides
it in response to a read request and aborts the memory access
– Complexity from retrieving cache block from a processor cache,

which can take longer than retrieving it from memory
Basic Implementation Techniques
• Normal cache tags can be used for snooping

• Valid bit per block makes invalidation easy
• Read misses easy since rely on snooping
• Writes -> Need to know if know whether any other copies of the block are
cached
– No other copies -> No need to place write on bus for WB
– Other copies ->Need to place invalidate on bus
Cache Resources for WB Snooping
• To track whether a cache block is shared, add extra state bit

associated with each cache block, like valid bit and dirty bit
– Write to Shared block => Need to place invalidate on bus and mark
cache block as private (if an option)
– No further invalidations will be sent for that block
– This processor called owner of cache block
– Owner then changes state from shared to unshared (or exclusive)
An WB Snoopy Protocol
• Invalidation protocol, write-back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response
to the read request and aborts the memory access
• Each memory block is in one of three state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one of three state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data (in uniprocessor cache too)
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses
Example
In shared state memory also updated

Exercise
If four processors are sharing memory using UMA model ,Perform snoopy
based protocol for the following sequence of memory operations. Write the
state of the cache blocks for each of the following operation.
P1 reads X
P4 reads X
P3 writes to X
P2 reads X
P1 writes to X
P2 writes to X
Where X is a block in main memory
Extensions to the Basic Coherence Protocol
• The basic protocol is a simple three-state MSI (Modified, Shared, Invalid) protocol.
• Optimize the protocol by adding additional states to improve performance

– MESI (Modified, Exclusive, Shared, and Invalid) protocol, e.g. Intel i7
• Add exclusive state to indicate that a cache block is resident in only a single cache but is
clean. (It can be written without generating any invalidates.)
– MOESI (Modified, Owned, Exclusive, Shared, and Invalid protocol, e.g. AMD Opteron
• Add owned state to indicate that the main memory copy is out of date and that the
designated cache is the owner.
• When there is an attempt to share a block in the Modified state

– The block must be written back to memory in the original
– The block can be changed (from Modified) to Owned state without writing it to memory.
Limitations in SMPs and Snooping Protocols

As the number of processors grows, a single shared bus soon becomes a
bottleneck (because every cache must examine every miss, and having
additional interconnection bandwidth only pushes the problem to the
cache).
Techniques for Increasing the Snoop Bandwidth
• The tags can be duplicated. This doubles the effective cache-level

snoop bandwidth
• If the outermost cache on a multicore (typically L3) is shared, we can
distribute that cache so that each processor has a portion of the memory
and handles snoops for that portion of the address space
We can place a directory at the level of the outermost shared cache

(say, L3). L3 acts as a filter on snoop requests and must be inclusive
True and false sharing misses
• Coherence influences cache miss rate

• Coherence misses can be broken into two separate sources.
• True sharing misses
• Write to shared block (transmission of invalidation)
• Read an invalidated block
• False sharing misses
• A block is invalidated because some word in the block other
• than the one read was written into. A subsequent reference to
the block causes a miss
Example :
Assume that words x1 and x2 are in the same cache block, which is in the shared
state in the caches of both P1 and P2. Assuming the following sequence of Events.
Identify the true and false sharing misses in each of the five steps.
Prior to time step 1 , x1 was read by P2.

Answer :Here are the classifications by time step:
1. This event is a true sharing miss, since x1 was read by P2 and needs
to be invalidated from P2.
2. This event is a false sharing miss, since x2 was invalidated by the
write of x1 in P1, but that value of x1 is not used in P2.
3. This event is a false sharing miss, since the block containing x1 is

marked shared due to the read in P2, but P2 did not read x1. The
cache block containing x1 will be in the shared state after the read by
P2; a write miss is required to obtain exclusive access to the block. In
some protocols this will be handled as an upgrade request, which
generates a bus invalidate, but does not transfer the cache block.
4. This event is a false sharing miss for the same reason as step 3.
5. This event is a true sharing miss, since the value being read was written by P2.
Two ways to prevent false sharing.
Changing the cache block size or the amount of the cache block covered by
a given valid bit are hardware changes .
Software solution : allocate data objects so that

1. only one truly shared object occurs per cache block frame in memory and
2. no non-shared objects are located in the same cache block frame as any
shared object. If this is done, then even with just a single valid bit per cache
block, false sharing is impossible.
THANK YOU
Rajeshwari B
Engineering
rajeshwari@pes.edu
Rajeshwari B
Engineering
Rajeshwari B
Department of Electronics and Communication Engineering
Distributed Shared-Memory and Directory-Based Coherence
•Snoopy schemes broadcast

requests over memory bus
•Difficult to scale to large numbers of
processors
•Requires additional
bandwidth to cache tags for snoop requests
Alternative idea: avoid broadcast by storing information about the status of the
line in one place: a “directory”
- The directory entry for a cache line contains information about the
state of the cache line in all caches.
- Caches look up information from the directory as necessary
- Cache coherence is maintained by point-to-point messages between

the caches on a “need to know” basis (not by broadcast mechanisms)
-Prevent directory as bottleneck: distribute directory entries with memory,

each keeping track of which Procs have copies of their blocks
A directory is added to each node to implement cache coherence
in a distributed-memory multiprocessor
– Local node is the node where a request originates

– Home node is the node where the memory location of an address resides
Example: node 0 is the home node of the yellow line, node 1 is the home node of the
blue line
– Remote node is the node that has a copy of a cache block, whether exclusive or shared
Directory Protocol
• Similar to Snoopy Protocol: Three states
– Shared: 1 processors have data, memory up-to-date
– Uncached (no processor has it; not valid in any cache)
– Exclusive: 1 processor (owner) has data; memory out of-date
• In addition to cache state, must track which processors have data when in
the shared state (usually bit vector, 1 if processor has copy)
• Writes to non-exclusive data => write miss

– Processor blocks until access completes
– Assume messages received and acted upon in order sent
Example Directory Protocol

• Block is in Uncached state: the copy in memory is the current value; only
possible requests for that block are:
– Read miss: requesting processor sent data from memory & requestor
made only sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing
node. The block is made Exclusive to indicate that the only valid copy is
cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:
– Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the
set Sharers are sent invalidate messages, & Sharers is set to identity of
requesting processor. The state of the block is made Exclusive.
• Block is Exclusive: current value of the block is held in the cache of the processor
identified by the set Sharers (the owner) => three possible directory requests:
– Read miss: owner processor sent data fetch message, which causes state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor. Identity of requesting
processor is added to set Sharers, which still contains the identity of the processor that was
the owner (since it still has a readable copy).
– Data write-back: owner processor is replacing the block and hence must write it back. This
makes the memory copy up-todate (the home directory essentially becomes the owner),
the block is now uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the cache to
send the value of the block to the directory from which it is sent to the requesting
processor, which becomes the new owner. Sharers is set to identity of new owner, and state
of block is made Exclusive.
Example 1: read miss to clean line
Example 1: read miss to clean line -contd
▪ Read miss message sent to home node

of the requested line
▪ Home directory checks entry for line
- If dirty bit for cache line is OFF,
- respond with contents from memory, set
presence[0] to true
(to indicate line is cached by processor 0)
1. If dirty bit is ON, then data must be sourced by another processor

2. Home node responds with message providing identity of line owner
3. Requesting node requests data from owner
4. Owner responds to requesting node, changes state in cache to SHARED (read only)
5. Owner also responds to home node, home clears dirty, updates presence bits,
updates memory
2: read miss to dirty line:Read from main memory by processor 0 of the blue
line: line is dirty (contents in P2’s cache)
2: read miss to dirty line-contd

3: write miss
Write to memory by processor 0: line is clean, but (shared)resident in P1’s and P2’s caches
3: write miss-contd
State transition diagram for an individual cache block in a
directory based system.
The state transition diagram for the directory has the same states and
structure as the transition diagram for an individual cache
Advantage of directories
▪ On reads, directory tells requesting node exactly where to get the line from
- Either from home node (if the line is clean)
- Or from the owning node (if the line is dirty)
- Either way, retrieving data involves only point-to-point communication
▪ On writes, the advantage of directories depends on the number of sharers

- In the limit, if all caches are sharing data, all caches must be communicated
with (just like broadcast in a snooping protocol)
Example
Perform directory based protocol for the following sequence of memory operations and show the
status of directories . Assume three processors A,B and C each having 1 GB memory. Let data
location ‘x’ be in A memory and location ‘y’ in B memory.
The sequence of memory operations are A reads x, B reads x, C reads x, A write x, C write x, B read
x, A read y, B write x , B read y , B write x and B read y.
Synchronization: The Basics
• Synchronization : necessary for coordination of complex activities and

multi-threading.
• Atomic actions : actions that cannot be interrupted
• Locks : mechanisms to protect a critical section, code that can be executed
by only one thread at a time
• Hardware support for locks
Thread coordination - multiple threads of a thread group need to act in
concert
Mutual exclusion - only one thread at a time should be allowed to perform
an action
Atomicity requires hardware support:
Test-and-Set : instruction which writes to a memory location and

returns the old content of that memory cell as non-interruptible.
Compare-and-Swap :instruction which compares the contents of a memory
location to a given value and, only if the two values are the
same, modifies the contents of that memory location to a given
new value.
The pair of load linked or load locked and a special store called a store conditional.
These instructions are used in sequence:
If the contents of the memory location specified by the load linked are changed
before the store conditional to the same address, the store conditional fails
Atomic exchange on the memory location specified by
the contents of R1:
try: MOV R3,R4 ;mov exchange value
LL R2,0(R1) ;load linked
SC R3,0(R1) ;store conditional
BEQZ R3,try ;branch store fails
MOV R4,R2 ;put load value in R4
At the end of this sequence the contents of R4 and the memory

location specified by R1 have been atomically exchanged
Atomic fetch-and-increment: Using LL and SC

try: LL R2,0(R1) ;load linked
DADDUIR3,R2,#1 ;increment
SC R3,0(R1) ;store conditional
BEQZ R3,try ;branch store fails
Spin lock : a lock the system keep trying to acquire
The system supports cache coherence -cache the locks

- When a thread tries to acquire a lock it will use the local copy rather
than requiring a global memory access.
- Locality of lock access – a thread that used a lock is likely to need it again.
Implementing locks
Implementing locks
Advantage of this scheme: reduces memory traffic lock=0=> unlocked
lock=1 =>locked
Cache coherence steps and bus traffic for three processors, P0, P1, and P2.
Models of Memory Consistency:
• Multiple processors/cores should see a consistent view of

memory.
• What properties should be enforced among reads and writes to
different location in a multiprocessor?
• Sequential consistency :the result of any execution be the same

as if the memory accesses by each processor be kept in order
and the accesses were interleaved
• To enforce it require a processor to delay any memory access
until all invalidations caused by that access are completed
• Simplicity from programmer’s point of view
• Performance disadvanatage
Models of Memory Consistency :Example
Two threads are running on different processors and A and B are
cached by each processor and the initial values are 0.
Should be impossible for both if-statements to be evaluated as true

-Should give correct result.
Sequential consistency: Result of execution should be the same as long as:
-Accesses on each processor were kept in order
- Accesses on different processors were arbitrarily interleaved
Models of Memory Consistency:
The simplest way to implement sequential consistency is to require a processor to delay
the completion of any memory access until all the invalidations
Although sequential consistency presents a simple programming paradigm, it reduces

potential performance
More efficient implementation is to assume that programs are synchronized.
A program is synchronized if all accesses to shared data are ordered by synchronization

operations.
Program-enforced synchronization to force write on processor to occur before read on

the other processor
_ Requires synchronization object for A and another for B
- “Unlock” after write
- “Lock” after read
Models of Memory Consistency:Problem
Suppose we have a processor where a write miss takes 50 cycles to establish ownership, 10
cycles to issue each invalidate after ownership is established, and 80 cycles for an invalidate
to complete and be acknowledged once it is issued. Assuming that four other processors
share a cache block, how long does a write miss stall the writing processor if the processor
is sequentially consistent?
Assume that the invalidates must be explicitly acknowledged before the coherence
controller knows they are completed. Suppose we could continue executing after obtaining
ownership for the write miss without waiting for the invalidates; how long would the write
take?
Models of Memory Consistency: Solution
Answer When we wait for invalidates, each write takes the sum of the
ownership time plus the time to complete the invalidates. Since the
invalidates can overlap, we need only worry about the last one, which
starts 10 + 10 + 10 + 10 = 40 cycles after ownership is established.
Hence, the total time for the write is 50 + 40 + 80 = 170 cycles.
In comparison, the ownership time is only 50 cycles. With appropriate
write buffer implementations, it is even possible to continue before
ownership is established.
Relaxed consistency models
Rules:
X→Y
Operation X must complete before operation Y is done
Sequential consistency requires: all four
R → W, R → R, W → R, W → W
Relaxing
1. Relax W → R
“Total store ordering”
2. Relax W → W
“Partial store order”
3. Relax R → W and R → R
“Weak ordering” and “release consistency”
THANK YOU
Rajeshwari B
Engineering
rajeshwari@pes.edu

Computer Architecture Thread-Level Parallelism

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture Thread-Level Parallelism

Uploaded by

Copyright:

Available Formats

COMPUTER ARCHITECTURE

• Multiprocessor Architecture: Challenges of Parallel Processing

• A growing interest in high-end servers as cloud computing and software-as a-

• A growth in data-intensive applications driven by the availability of massive

• Increasing desktop performance less important

• Improved understanding in how to use multiprocessors effectively

• Advantage of leveraging design investment by replication

• Two different software models

• How to exploit thread-level parallelism efficiently?

Architecture for Multiprocessors -2 Models for Communication

2 classes of multiprocessors WRT memory:

Distributed shared memory (DSM)

– Memory is distributed among

• For example, reduce frequency of remote accesses either by

Cache Coherence Problem

1. Coherence defines values returned by a read

• Coherence defines behavior of reads and writes to same location,

Defining Coherent Memory System

1. Preserve Program Order: A read by processor P to location X that follows a write by

2. Coherent view of memory: A read by a processor to location X that follows a write by

Basic Schemes for Enforcing Coherence(to track the sharing status)

2 Classes of Cache Coherence Protocols

2. Snooping — Every cache with a copy of data also has a copy of

• Write Invalidate Protocol: – Multiple readers, single writer – Write to

• Write-back: it is harder to get up-to-date copy of data

• Can use same snooping mechanism

– Complexity from retrieving cache block from a processor cache,

• Normal cache tags can be used for snooping

Cache Resources for WB Snooping

• To track whether a cache block is shared, add extra state bit

In shared state memory also updated

• Optimize the protocol by adding additional states to improve performance

• When there is an attempt to share a block in the Modified state

Limitations in SMPs and Snooping Protocols

• The tags can be duplicated. This doubles the effective cache-level

We can place a directory at the level of the outermost shared cache

• Coherence influences cache miss rate

Prior to time step 1 , x1 was read by P2.

3. This event is a false sharing miss, since the block containing x1 is

Two ways to prevent false sharing.

Software solution : allocate data objects so that

•Snoopy schemes broadcast

- Caches look up information from the directory as necessary

- Cache coherence is maintained by point-to-point messages between

-Prevent directory as bottleneck: distribute directory entries with memory,

– Local node is the node where a request originates

• Writes to non-exclusive data => write miss

Example Directory Protocol

Example 1: read miss to clean line -contd

▪ Read miss message sent to home node

1. If dirty bit is ON, then data must be sourced by another processor

2: read miss to dirty line-contd

▪ On writes, the advantage of directories depends on the number of sharers

• Synchronization : necessary for coordination of complex activities and

Test-and-Set : instruction which writes to a memory location and

At the end of this sequence the contents of R4 and the memory

Atomic fetch-and-increment: Using LL and SC

Spin lock : a lock the system keep trying to acquire

The system supports cache coherence -cache the locks

• Multiple processors/cores should see a consistent view of

• Sequential consistency :the result of any execution be the same

Should be impossible for both if-statements to be evaluated as true