CH 2 SymmShared Performance Issues

Multiprocessors and ThreadLevel Parallelism
UNIT IV
Performance of Symmetric Shard-Memory Multiprocessors
Performance Symmetric Shared Memory

In a bus based multiprocessor using an invalidation protocol, Overall cache performance is a combination of Uniprocessor cache miss traffic Traffic caused by communication, which results in invalidations and subsequent cache misses. Changing the processor count, cache size, and block size can affect these two components of miss rate
Performance Symmetric Shared Memory

UniProcessor Miss rates are : 1. Compulsory 2. Capacity 3. Conflict Communication miss rate: coherence misses True sharing misses + false sharing misses
Coherence Misses
Def : The misses that rise from interprocessor communications. It can be broken down to 2 separate sources
True sharing misses False sharing misses
1. True Sharing Misses

It arises from the communication of data through the cache coherence mechanism. The first write by a PE to a shared cache block causes an invalidation to establish ownership of that block When another PE attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred
2. False Sharing Misses

It arises from the use of an invalidation based coherence algorithm with a single valid bit per cache block. Occur when a block is invalidated (and a subsequent reference causes a miss) because some word in the block, other than the one being read, is written into. invalidation does not cause a new value to be communicated, but only causes an extra cache miss block is shared, but no word in the cache is actually shared, 7
True and False Sharing Miss Example

Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss or a false sharing miss.
Time 1 2 3 4 5 Read x2 Write x1 Write x2
8
P1 Write x1
P2 Read x2
Example Result
1: True sharing miss (invalidate P2) 2: False sharing miss x2 was invalidated by the write of P1, but that value of x1 is not used in P2 3: False sharing miss The block containing x1 is marked shared due to the read in P2, but P2 did not read x1. A write miss is required to obtain exclusive access to the block 4: False sharing miss 5: True sharing miss
Solution Explanation
1. This event is a true sharing miss, since x1 was read by P2 and needs to be invalidated from P2. 2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2. 3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1. a write miss is required to obtain exclusive access to the block 1. This event is a false sharing miss for the same reason as step 3. 2. This event is a true sharing miss, since the value being read was written by P2.
10
Performance Measurements
The following are the diff. performance measurements of symmetric shared memory multiprocessors: 1. Commercial workload 2. Multiprogramming & OS workload 3. Scientific / Technical workload
11
The following model are taken for the performance measurements of the commercial workload Alphaserver 4100 Configurable simulator model
1. Performance Measurements Commercial workload
12
Performance Measurements Commercial workload - Alphaserver 4100

The Alphaserver 4100 has four processors Each processor has a three-level cache hierarchy: 1. L1 consist of a pair of 8 KB direct-mapped on-chip caches,
One for instruction One for data
1. L2 is a 96 KB on-chip unified 3-way set associative cache with a 32-byte block size, using write-back. 2. L3 is an off-chip, combined, direct-mapped 2 MB caches with 64-byte blocks also using write-back.
13
Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )

The latency for an access to o L2 = 7 cycles o L3 = 21 cycles o Main memory = 80 clock cycles Execution time breaks down into instruction execution time Cache access time Memory access time and Other stalls
14
Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )

Performance Performance of the DSS (Decision Support System) and Altavista workloads is reasonable Performance of OLTP(OnLine Transaction Processing ) workload is poor Impact on the OLTP benchmark of L3 cache size Processor count Block size are focused bcoz OLTP workload demands from the memory system with large numbers of
15
Execution Time of Commercial Workload

Poor performance on OLTP due to L3 misses (due to a poor performance of the memory hierarchy)
Two-way set associative caches

The foll. diagram shows the effect of increasing the cache size, using 2-way set associative caches, which reduces the large number of conflict misses. Effect of increasing the cache size reduces the large number of conflict misses using 2-way set associative caches.
Execution time is improved as L3 cache grows due to the reduction in L3 misses Idle time also gets improved as Reduces performance gains
17
Use two-way set associative
Relative Performance of the OLTP w.r.t. the Size of L3
Contributing causes of memory access cycles

The following diagram displays the number of memory access cycles contributed per instruction from 5 sources. True sharing False sharing Instruction
Capacity / Conflict Cold
19
Contributing Causes of Memory Access Cycles with Increase Size of L3
Memory Access Cycles versus Processor Count
L3 Miss Rate versus Block Size
Explanations
True sharing & False sharing unchanged going from 1 MB to 8 MB ( L3 cache ) Uniprocessor cache misses improve with cache size increase(Instruction, Capacity/Conflict, Compulsory) L3 cache simulated as two-way set associative. The cold, false sharing and true sharing are unaffected by L3 cache Contribution to memory access cycles increase as processor count increase primarily due to increase true sharing Increase in true sharing miss rate leads to an overall increase in memory access cycles per instruction
23
2. Performance Measurements of the Multiprogramming and OS workload

2 independent copies of the compile phase of Andrew benchmark o A parallel make using 8 processors o Run for 5.24 seconds on 8 processors, creating 203 processes & performing 787 disk requests on 3 different file systems o Run with 128 MB of memory & no paging activity
24
Performance Measurements of the Multiprogramming and OS workload(Cont)

2 distinct phases o Compile : Substantial compute activity o Install the object files in a binary : dominated by I/O o Remove the object files: dominated by I/O & 2 PEs are active
25
Performance Measurements of the Multiprogramming and OS workload(Cont)

Measure CPU idle time & I-cache performance L1 I-Cache: 32 KB, 2-way set associative with 64byte block, 1 CC hit time L1 D-Cache: 32 KB, 2-way set associative with 32byte block, 1 CC hit time L2 Cache: 1 MB unified, 2-way set associative with 128-byte block, 10 CC hit time Main memory: Single memory on a bus with an access time of 100CC Disk System: Fixed-access latency of 3ms (less 26
Distribution of Execution Time in the Multiprogrammed Parallel Make Workload

User Execution % instruction executed % execution time 27 27 Kernel Execution 3 7 Synchronization Wait 1 2 CPU idle for I/O 69 64
A significant I-cache performance loss (at least for OS) I-cache miss rate in OS for a 64-byte block size, 2-way se associative: 1.7% (32KB) 0.2% (256KB) I-cache miss rate in user-level: 1/6 of OS rate
27
Data Miss Rate vs Data Cache Size

The misses can be broken into 3 significant classes 1. Compulsory misses represent the first access to this block by this processor & are significant in this workload 2. Coherence misses represent misses due to invalidations 3. Normal capacity misses include misses caused by interference between the OS & user process & between multiple user processes
28
User drops : a Factor of 3 Kernal drops : a Factor of 1.3
Data miss rate vs Data cache size
29
Reasons
The reasons in which the behavior of the OS is more complex than the user processes are 1. The kernal initializes all pages before allocating to a user 2. The kernal shares data and has a non trivial coherence miss rate
30
Components of Kernel Miss Rate

High rate of Compulsory and Coherence miss
31
Components of kernal miss rate

Compulsory miss rate stays constant Capacity miss rate drops by more than a factor of 2.
including conflict miss rate
Coherence miss rate nearly doubles

The probability of a miss being caused by an invalidation increases with cache size
32
Kernel and User Behavior

Kernel behavior
Initialize all pages before allocating them to user compulsory miss Kernel actually shares data coherence miss
User process behavior

Cause coherence miss only when the process is scheduled on a different processor small miss rate
33
Miss Rate VS. Block Size

32KB 2-way set associative data cache User drops: a factor of under 3 Kernel drops: a factor of 4
34
Miss Rate VS. Block Size for Kernel
Compulsory drops significantly
Stay constant
35
Miss rate vs. block size

Compulsory & Capacity miss can be reduced with larger block sizes Largest improvement is reduction of compulsory miss rate Absence of large increases in the coherence miss rate as block size is increased means that false sharing effects are insignificant
36
37

CH 2 SymmShared Performance Issues

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 2 SymmShared Performance Issues

Uploaded by

Copyright:

Available Formats

Multiprocessors and ThreadLevel Parallelism

Performance of Symmetric Shard-Memory Multiprocessors

Performance Symmetric Shared Memory

Performance Symmetric Shared Memory

1. True Sharing Misses

2. False Sharing Misses

True and False Sharing Miss Example

1. Performance Measurements Commercial workload

Performance Measurements Commercial workload - Alphaserver 4100

Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )

Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )

Execution Time of Commercial Workload

Two-way set associative caches

Use two-way set associative

Relative Performance of the OLTP w.r.t. the Size of L3

Contributing causes of memory access cycles

Contributing Causes of Memory Access Cycles with Increase Size of L3

Memory Access Cycles versus Processor Count

L3 Miss Rate versus Block Size

2. Performance Measurements of the Multiprogramming and OS workload

Performance Measurements of the Multiprogramming and OS workload(Cont)

Performance Measurements of the Multiprogramming and OS workload(Cont)

Distribution of Execution Time in the Multiprogrammed Parallel Make Workload

Data Miss Rate vs Data Cache Size

User drops : a Factor of 3 Kernal drops : a Factor of 1.3

Data miss rate vs Data cache size

Components of Kernel Miss Rate

Components of kernal miss rate

Coherence miss rate nearly doubles

Kernel and User Behavior

User process behavior

Miss Rate VS. Block Size

Miss Rate VS. Block Size for Kernel

Compulsory drops significantly

Miss rate vs. block size

You might also like