You are on page 1of 37

Multiprocessors and ThreadLevel Parallelism

UNIT IV

Performance of Symmetric Shard-Memory Multiprocessors

Performance Symmetric Shared Memory


In a bus based multiprocessor using an invalidation protocol, Overall cache performance is a combination of Uniprocessor cache miss traffic Traffic caused by communication, which results in invalidations and subsequent cache misses. Changing the processor count, cache size, and block size can affect these two components of miss rate

Performance Symmetric Shared Memory


UniProcessor Miss rates are : 1. Compulsory 2. Capacity 3. Conflict Communication miss rate: coherence misses True sharing misses + false sharing misses

Coherence Misses
Def : The misses that rise from interprocessor communications. It can be broken down to 2 separate sources
True sharing misses False sharing misses

1. True Sharing Misses


It arises from the communication of data through the cache coherence mechanism. The first write by a PE to a shared cache block causes an invalidation to establish ownership of that block When another PE attempts to read a modified word in that cache block, a miss occurs and the resultant block is transferred

2. False Sharing Misses


It arises from the use of an invalidation based coherence algorithm with a single valid bit per cache block. Occur when a block is invalidated (and a subsequent reference causes a miss) because some word in the block, other than the one being read, is written into. invalidation does not cause a new value to be communicated, but only causes an extra cache miss block is shared, but no word in the cache is actually shared, 7

True and False Sharing Miss Example


Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss or a false sharing miss.
Time 1 2 3 4 5 Read x2 Write x1 Write x2
8

P1 Write x1

P2 Read x2

Example Result
1: True sharing miss (invalidate P2) 2: False sharing miss x2 was invalidated by the write of P1, but that value of x1 is not used in P2 3: False sharing miss The block containing x1 is marked shared due to the read in P2, but P2 did not read x1. A write miss is required to obtain exclusive access to the block 4: False sharing miss 5: True sharing miss

Solution Explanation
1. This event is a true sharing miss, since x1 was read by P2 and needs to be invalidated from P2. 2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2. 3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1. a write miss is required to obtain exclusive access to the block 1. This event is a false sharing miss for the same reason as step 3. 2. This event is a true sharing miss, since the value being read was written by P2.
10

Performance Measurements
The following are the diff. performance measurements of symmetric shared memory multiprocessors: 1. Commercial workload 2. Multiprogramming & OS workload 3. Scientific / Technical workload

11

The following model are taken for the performance measurements of the commercial workload Alphaserver 4100 Configurable simulator model

1. Performance Measurements Commercial workload

12

Performance Measurements Commercial workload - Alphaserver 4100


The Alphaserver 4100 has four processors Each processor has a three-level cache hierarchy: 1. L1 consist of a pair of 8 KB direct-mapped on-chip caches,
One for instruction One for data

1. L2 is a 96 KB on-chip unified 3-way set associative cache with a 32-byte block size, using write-back. 2. L3 is an off-chip, combined, direct-mapped 2 MB caches with 64-byte blocks also using write-back.
13

Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )


The latency for an access to o L2 = 7 cycles o L3 = 21 cycles o Main memory = 80 clock cycles Execution time breaks down into instruction execution time Cache access time Memory access time and Other stalls
14

Performance Measurements Commercial workload - Alphaserver 4100 ( Cont )


Performance Performance of the DSS (Decision Support System) and Altavista workloads is reasonable Performance of OLTP(OnLine Transaction Processing ) workload is poor Impact on the OLTP benchmark of L3 cache size Processor count Block size are focused bcoz OLTP workload demands from the memory system with large numbers of

15

Execution Time of Commercial Workload


Poor performance on OLTP due to L3 misses (due to a poor performance of the memory hierarchy)

Two-way set associative caches


The foll. diagram shows the effect of increasing the cache size, using 2-way set associative caches, which reduces the large number of conflict misses. Effect of increasing the cache size reduces the large number of conflict misses using 2-way set associative caches.
Execution time is improved as L3 cache grows due to the reduction in L3 misses Idle time also gets improved as Reduces performance gains
17

Use two-way set associative

Relative Performance of the OLTP w.r.t. the Size of L3

Contributing causes of memory access cycles


The following diagram displays the number of memory access cycles contributed per instruction from 5 sources. True sharing False sharing Instruction
Capacity / Conflict Cold

19

Contributing Causes of Memory Access Cycles with Increase Size of L3

Memory Access Cycles versus Processor Count

L3 Miss Rate versus Block Size

Explanations
True sharing & False sharing unchanged going from 1 MB to 8 MB ( L3 cache ) Uniprocessor cache misses improve with cache size increase(Instruction, Capacity/Conflict, Compulsory) L3 cache simulated as two-way set associative. The cold, false sharing and true sharing are unaffected by L3 cache Contribution to memory access cycles increase as processor count increase primarily due to increase true sharing Increase in true sharing miss rate leads to an overall increase in memory access cycles per instruction

23

2. Performance Measurements of the Multiprogramming and OS workload


2 independent copies of the compile phase of Andrew benchmark o A parallel make using 8 processors o Run for 5.24 seconds on 8 processors, creating 203 processes & performing 787 disk requests on 3 different file systems o Run with 128 MB of memory & no paging activity

24

Performance Measurements of the Multiprogramming and OS workload(Cont)


2 distinct phases o Compile : Substantial compute activity o Install the object files in a binary : dominated by I/O o Remove the object files: dominated by I/O & 2 PEs are active

25

Performance Measurements of the Multiprogramming and OS workload(Cont)


Measure CPU idle time & I-cache performance L1 I-Cache: 32 KB, 2-way set associative with 64byte block, 1 CC hit time L1 D-Cache: 32 KB, 2-way set associative with 32byte block, 1 CC hit time L2 Cache: 1 MB unified, 2-way set associative with 128-byte block, 10 CC hit time Main memory: Single memory on a bus with an access time of 100CC Disk System: Fixed-access latency of 3ms (less 26

Distribution of Execution Time in the Multiprogrammed Parallel Make Workload


User Execution % instruction executed % execution time 27 27 Kernel Execution 3 7 Synchronization Wait 1 2 CPU idle for I/O 69 64

A significant I-cache performance loss (at least for OS) I-cache miss rate in OS for a 64-byte block size, 2-way se associative: 1.7% (32KB) 0.2% (256KB) I-cache miss rate in user-level: 1/6 of OS rate
27

Data Miss Rate vs Data Cache Size


The misses can be broken into 3 significant classes 1. Compulsory misses represent the first access to this block by this processor & are significant in this workload 2. Coherence misses represent misses due to invalidations 3. Normal capacity misses include misses caused by interference between the OS & user process & between multiple user processes
28

User drops : a Factor of 3 Kernal drops : a Factor of 1.3

Data miss rate vs Data cache size

29

Reasons
The reasons in which the behavior of the OS is more complex than the user processes are 1. The kernal initializes all pages before allocating to a user 2. The kernal shares data and has a non trivial coherence miss rate

30

Components of Kernel Miss Rate


High rate of Compulsory and Coherence miss

31

Components of kernal miss rate


Compulsory miss rate stays constant Capacity miss rate drops by more than a factor of 2.
including conflict miss rate

Coherence miss rate nearly doubles


The probability of a miss being caused by an invalidation increases with cache size

32

Kernel and User Behavior


Kernel behavior
Initialize all pages before allocating them to user compulsory miss Kernel actually shares data coherence miss

User process behavior


Cause coherence miss only when the process is scheduled on a different processor small miss rate

33

Miss Rate VS. Block Size


32KB 2-way set associative data cache User drops: a factor of under 3 Kernel drops: a factor of 4

34

Miss Rate VS. Block Size for Kernel

Compulsory drops significantly

Stay constant

35

Miss rate vs. block size


Compulsory & Capacity miss can be reduced with larger block sizes Largest improvement is reduction of compulsory miss rate Absence of large increases in the coherence miss rate as block size is increased means that false sharing effects are insignificant

36

37

You might also like