Professional Documents
Culture Documents
Rajeshwari B
Department of Electronics and Communication
Engineering
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Rajeshwari B
Department of Electronics and Communication Engineering
COMPUTER ARCHITECTURE
Thread-Level Parallelism
UNIT 5
Shared‐Memory Multiprocessor
Symmetric multiprocessors (SMP)
– A centralized memory
– Small number of cores (< 32)
– Share single memory with uniform
memory latency
_It becomes less attractive as the
number of processors sharing centralized
memory increases
COMPUTER ARCHITECTURE
Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of the original computation can be sequential.
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Challenges of Parallel Processing
• The second challenge is the relatively high cost of communications
Eg: Suppose 32-core MP, 4GHz, 100 ns remote memory, all local accesses hit
memory hierarchy and base CPI is 0.5. What is performance impact if 0.2%
instructions involve remote access?
• Answer:
– Remote access = 100/0.25 = 400 clock cycles.
– CPI = Base CPI + Remote request rate x Remote request cost
– CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
– No communication (the MP with all local reference) is 1.3/0.5 or 2.6
faster than 0.2% instructions involve remote access
COMPUTER ARCHITECTURE
Thread-Level Parallelism
1.Application parallelism -> primarily via new algorithms that have better
parallel performance
2. Long remote latency impact -> both by architect and by the programmer
Much of this lecture focuses on techniques for reducing the impact of long
remote latency
– How caching can be used to reduce remote access frequency, while
maintaining a coherent view of memory?
– Efficient synchronization?
– And, latency-hiding techniques
COMPUTER ARCHITECTURE
Thread-Level Parallelism
3. Write serialization: 2 writes to same location by any 2 processors are seen in the same
order by all processors.
– If not, a processor could keep value 1 since saw as last write
– For example, if the values 1 and then 2 are written to a location, processors can never read
the value of the location as 2 and then later read it as 1
COMPUTER ARCHITECTURE
Thread-Level Parallelism
An WB Snoopy Protocol
• Invalidation protocol, write-back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response
to the read request and aborts the memory access
• Each memory block is in one of three state:
– Clean in all caches and up-to-date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one of three state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data (in uniprocessor cache too)
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Example
Exercise
If four processors are sharing memory using UMA model ,Perform snoopy
based protocol for the following sequence of memory operations. Write the
state of the cache blocks for each of the following operation.
P1 reads X
P4 reads X
P3 writes to X
P2 reads X
P1 writes to X
P2 writes to X
Where X is a block in main memory
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Extensions to the Basic Coherence Protocol
• The basic protocol is a simple three-state MSI (Modified, Shared, Invalid) protocol.
• Add exclusive state to indicate that a cache block is resident in only a single cache but is
clean. (It can be written without generating any invalidates.)
– MOESI (Modified, Owned, Exclusive, Shared, and Invalid protocol, e.g. AMD Opteron
• Add owned state to indicate that the main memory copy is out of date and that the
designated cache is the owner.
Example :
Assume that words x1 and x2 are in the same cache block, which is in the shared
state in the caches of both P1 and P2. Assuming the following sequence of Events.
Identify the true and false sharing misses in each of the five steps.
4. This event is a false sharing miss for the same reason as step 3.
5. This event is a true sharing miss, since the value being read was written by P2.
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Changing the cache block size or the amount of the cache block covered by
a given valid bit are hardware changes .
2. no non-shared objects are located in the same cache block frame as any
shared object. If this is done, then even with just a single valid bit per cache
block, false sharing is impossible.
THANK YOU
Rajeshwari B
Department of Electronics and Communication
Engineering
rajeshwari@pes.edu
COMPUTER ARCHITECTURE
Rajeshwari B
Department of Electronics and Communication
Engineering
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Rajeshwari B
Department of Electronics and Communication Engineering
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Distributed Shared-Memory and Directory-Based Coherence
Alternative idea: avoid broadcast by storing information about the status of the
line in one place: a “directory”
- The directory entry for a cache line contains information about the
state of the cache line in all caches.
Directory Protocol
• Similar to Snoopy Protocol: Three states
– Shared: 1 processors have data, memory up-to-date
– Uncached (no processor has it; not valid in any cache)
– Exclusive: 1 processor (owner) has data; memory out of-date
• In addition to cache state, must track which processors have data when in
the shared state (usually bit vector, 1 if processor has copy)
• Block is Exclusive: current value of the block is held in the cache of the processor
identified by the set Sharers (the owner) => three possible directory requests:
– Read miss: owner processor sent data fetch message, which causes state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor. Identity of requesting
processor is added to set Sharers, which still contains the identity of the processor that was
the owner (since it still has a readable copy).
– Data write-back: owner processor is replacing the block and hence must write it back. This
makes the memory copy up-todate (the home directory essentially becomes the owner),
the block is now uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the cache to
send the value of the block to the directory from which it is sent to the requesting
processor, which becomes the new owner. Sharers is set to identity of new owner, and state
of block is made Exclusive.
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Example 1: read miss to clean line
COMPUTER ARCHITECTURE
Thread-Level Parallelism
2: read miss to dirty line:Read from main memory by processor 0 of the blue
line: line is dirty (contents in P2’s cache)
COMPUTER ARCHITECTURE
▪ On reads, directory tells requesting node exactly where to get the line from
- Either from home node (if the line is clean)
- Or from the owning node (if the line is dirty)
- Either way, retrieving data involves only point-to-point communication
The pair of load linked or load locked and a special store called a store conditional.
These instructions are used in sequence:
If the contents of the memory location specified by the load linked are changed
before the store conditional to the same address, the store conditional fails
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Atomic exchange on the memory location specified by
the contents of R1:
try: MOV R3,R4 ;mov exchange value
LL R2,0(R1) ;load linked
SC R3,0(R1) ;store conditional
BEQZ R3,try ;branch store fails
MOV R4,R2 ;put load value in R4
Answer When we wait for invalidates, each write takes the sum of the
ownership time plus the time to complete the invalidates. Since the
invalidates can overlap, we need only worry about the last one, which
starts 10 + 10 + 10 + 10 = 40 cycles after ownership is established.
Hence, the total time for the write is 50 + 40 + 80 = 170 cycles.
In comparison, the ownership time is only 50 cycles. With appropriate
write buffer implementations, it is even possible to continue before
ownership is established.
COMPUTER ARCHITECTURE
Thread-Level Parallelism
Relaxed consistency models
Rules:
X→Y
Operation X must complete before operation Y is done
Sequential consistency requires: all four
R → W, R → R, W → R, W → W
Relaxing
1. Relax W → R
“Total store ordering”
2. Relax W → W
“Partial store order”
3. Relax R → W and R → R
“Weak ordering” and “release consistency”
THANK YOU
Rajeshwari B
Department of Electronics and Communication
Engineering
rajeshwari@pes.edu