Lec-8 Memory-4 CompArch

Lecture-8
Memory
Part-4
CSE-2823
Computer Architecture
Dr. Md. Waliur Rahman Miah
Associate Professor, CSE, DUET
1
Today’s Topic
Memory
Part-4
Ref:
Hennessy-Patterson 5e-Ch-5; 4e-Ch-7
Stallings 8e-Ch-4-5-6
Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 2

Direct Mapped Cache (review)
• Addressing scheme in direct mapped cache:

– cache block address = memory block address mod cache size
(unique)
– if cache size = 2m, cache address = lower m bits of n-bit
memory address
– remaining upper n-m bits kept as a tag bits at each cache
block
– also need a valid bit to recognize a valid entry

Direct Mapped Cache (review)
Address showing
Address bit positions
(showing bit positions)
31 30 13 12 11 210 Data
Addresses Byte
offset 32-bit
32-bit
Hit
Tag
20 10
Data = 4 Byte
Tag(20) +
Index
Index(10) +
Byte Offset (2) Index Valid Tag Data
0
1
Valid bit is not 2
associated with least significant

Address bit two bits specify
1021 1 of 4 bytes
1022 within a word.
1023
20 32
Cache with 1024 1-word blocks: byte offset

(least 2 significant bits) is ignored and
next 10 bits used to index into cache
If the tag and upper 20 bits of the address are equal and the valid bit is on, then the request
hits in the cache, and the word is supplied to the processor. Otherwise, a miss occurs.
Ref: Pg-392-393
Cache Read Hit/Miss
• Steps to be taken on a cache hit/miss:
• Cache read hit: no action needed
• Instruction cache read miss:
1. Send original PC value (current PC – 4, as PC has already
been incremented in first step of instruction cycle) to
memory [PC = Program Counter]
2. Instruct main memory to perform read and wait for
memory to complete access – stall on read
3. After read completes write cache entry
4. Restart instruction execution at first step to refetch
instruction
• Data cache read miss:
– Similar to instruction cache miss
– To reduce data miss penalty allow processor to execute
instructions while waiting for the read to complete until the
word is required – stall on use (why won’t this work for
instruction misses?)
Ref: Pg-393-394
Cache Write Hit/Miss
• Write-through scheme
– on write hit: replace data in cache and memory with every
write hit to avoid inconsistency
– on write miss: write the word into cache and memory –
obviously first need to fetch missed word from memory!
– Write-through is slow because of always required memory
write
• performance is improved with a write buffer where words are
stored while waiting to be written to memory – processor can
continue execution until write buffer is full
• when a word in the write buffer completes writing into main that
buffer slot is freed and becomes available for future writes
• Write-back scheme
– write the data block only into the cache and write-back
the block to main memory only when it is replaced in
cache
– more efficient than write-through, more complex to
implement
Direct Mapped Cache: with larger blocks
Address
Address showing bit positions
(showing bit positions)
31 16 15 4 32 1 0
16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data
4K
entries
16 32 32 32 32
Mux
Cache Size
32
64 KiB
Cache with 4K 4-word blocks: byte offset (least 2 significant bits) is ignored, next 2 bits
are block offset, and the next 12 bits are used to index into cache

Cache Size
16 KiB
FIGURE 5.12 The 16 KiB caches in the Intrinsity FastMATH each contain 256 blocks with 16 words per block. The
tag field is 18 bits wide and the index field is 8 bits wide, while a 4-bit field (bits 5–2) is used to index the block and
select the word from the block using a 16-to-1 multiplexor. In practice, to eliminate the multiplexor, caches use a
separate large RAM for the data and a smaller RAM for the tags, with the block off set supplying the extra address
Dr. data
bits for the large Md.RAM.
Waliur Rahman
In this Miah
case, the large RAM is Dept
32 bitsof CSE,
wide andDUET
must have 16 times as many words8 as
blocks in the cache.
Direct Mapped Cache: with larger blocks
• Cache replacement in large (multiword) blocks:

– word read miss: read entire block from main memory
– word write miss: cannot simply write word and tag!
– writing in a write-through cache:
• if write hit, i.e., tag of requested address and cache entry are equal,
continue as for 1-word blocks by replacing word and writing block to both
cache and memory
• if write miss, i.e., tags are unequal, fetch block from memory, replace word
that caused miss, and write block to both cache and memory
• therefore, unlike case of 1-word blocks, a write miss with a multiword
block causes a memory read

Direct Mapped Cache: Taking Advantage of
Spatial Locality with larger blocks
various
cache
sizes
FIGURE 5.11 Miss rate versus block size.

Direct Mapped Cache: Taking Advantage of
Spatial Locality with larger blocks contd.
• Miss rate falls at first with increasing block size as expected,

but, as block size becomes a large fraction of total cache
size, miss rate may go up because
– there are few blocks
– competition for blocks increases
– blocks get ejected before most of their words are accessed
(thrashing in cache)
Spatial locality among the words in a block

decreases with a very large block;
consequently, the benefits in the miss rate
become smaller.
Read the whole paragraph associated with

Fig. 5.11 pg-391-392
FIGURE 5.11 Miss rate versus block size.

Ref: Pg-398; Sec-5.4
Cache Performance
• Simplified model assuming equal read and write miss
penalties:
– CPU time = (execution cycles + memory stall cycles)  cycle time
– memory stall cycles = memory accesses  miss rate  miss penalty
[memory accesses for Reading and/or Writing]
• Therefore, two ways to improve performance in cache:
– decrease miss rate
– decrease miss penalty
– what happens if we increase block size?

• Cache performance improves at first but decreases afterwards due to increase in
miss penalty. Details page-392. Patterson.

Example Problems
• Assume for a given machine and program:
– instruction cache miss rate 2%
– data cache miss rate 4%
– miss penalty always 40 cycles
– CPI of 2 without memory stalls
– frequency of load/stores 36% of instructions
1. How much faster is a machine with a perfect cache that never misses?
2. What happens if we speed up the machine by reducing its CPI to 1
without changing the clock rate?
3. What happens if we speed up the machine by doubling its clock rate, but
if the absolute time for a miss penalty remains same?
[CPI = clock cycles per instruction ]

Solution
1. How much faster is a machine with a perfect cache that never
misses? Find the amount of execution time spent on memory stalls.
Assume instruction count = I (ayii not one) instruction cache miss rate 2%
data cache miss rate 4%
Instruction miss cycles = I  2%  40 = 0.8  I miss penalty always 40 cycles
Data miss cycles = I  36%  4%  40 = 0.576  I CPI of 2 without memory stalls
frequency of load/stores 36% of
So, total memory-stall cycles instructions
= 0.8  I + 0.576  I = 1.376  I
in other words, 1.376 stall cycles per instruction
Therefore, CPI with memory stalls = 2 + 1.376 = 3.376
Assuming instruction count and clock rate remain same for a perfect
cache and a cache that misses:
CPU time with stalls / CPU time with perfect cache
= (I  CPIstall  clock cycle ) / (I  CPIperfect  clock cycle )
= 3.376 / 2
= 1.688
Performance with a perfect cache is better by a factor of 1.688
[this value lower is better]
The amount of execution time spent on memory stalls:
time for memory-stall cycles 1.376
= .= .= 40.75%
execution time (= time for CPI with memory stalls ) 3.376
Solution (cont.)
instruction cache miss rate 2%
2. What happens if we speed up the machine data cache miss rate 4%
by reducing its CPI to 1 without changing miss penalty always 40 cycles
the clock rate? [Changed] CPI of 1 without memory stalls
frequency of load/stores 36% of
instructions
• CPI without stall = 1
• CPI with stall = 1 + 1.376 = 2.376 (clock has not changed so
stall cycles per instruction (1.376) remains same)
• CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 2.376 / 1
= 2.376
• Performance with a perfect cache is better by a factor of 2.376
• The amount of execution time spent on memory stalls
1.376/2.376 = 57.91% (increased from 40.75%)
• Conclusion: with lower CPI (machine with higher speed) cache misses “hurt
more” than with higher CPI
Solution (cont.)
instruction cache miss rate 2%
data cache miss rate 4%
3. What happens if we speed up the machine by miss penalty always 40 cycles
doubling its clock rate, but if the absolute time CPI of 2 without memory stalls
for a miss penalty remains same? frequency of load/stores 36%
of instructions
• With doubled clock rate, miss penalty = 2  40 = 80 clock cycles

• Stall cycles per instruction = (I  2%  80) + (I  36%  4%  80)
= 2.752  I
• So, faster machine with cache miss has CPI = 2 + 2.752 = 4.752
• CPU time with stalls / CPU time with perfect cache
= CPI with stall / CPI without stall
= 4.752 / 2 = 2.376
• Performance with a perfect cache is better by a factor of 2.376 (>1.688)
• Conclusion: with higher clock rate cache misses “hurt more” than with lower
clock rate
Read Example on page-400 Patterson
Average memory access time (AMAT)
• If the hit time increases (say due to increasing the cache size),
the total time to access a word from the memory system will
increase, possibly causing an increase in the processor cycle
time.
• To capture the fact that the time to access data for both hits
and misses affects performance, designers sometime use
average memory access time (AMAT) as a way to examine
alternative cache designs.
• Average memory access time (AMAT) is the average time to
access memory considering both hits and misses and the
frequency of different accesses:
AMAT = Time for a hit + Miss rate  Miss penalty

AMAT Example
• Find the AMAT for a processor with:
• clock cycle time = 1 ns
• miss penalty = 20 clock cycles
• miss rate = 0.05 misses per instruction, and
• cache access time = 1 clock cycle (including hit detection)
• Assume that the read and write miss penalties are the same
and ignore other write stalls.
• Soln:
AMAT (per instruction)
= Time for a hit + Miss rate  Miss penalty
= 1 + 0.05  20
= 2 clock cycles
or 2 ns.
Reference
[1] Patterson, D. A., & Hennessy, J. L. (2014). Computer
organization and design: The hardware/software
interface (5th ed.). Burlington, MA: Morgan Kaufmann
Publishers.
[2] William Stallings, (2010), Computer Organization and

Architecture,(8th Ed), Prentice Hall Upper Saddle River, NJ
07458.
[3] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Naraig

Manjikian, (2012), Computer Organization and
Embedded Systems (6th Ed), McGraw-Hill, New York, NY
10020.

Lec-8 Memory-4 CompArch

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec-8 Memory-4 CompArch

Uploaded by

Copyright:

Available Formats

Lecture-8

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 2

• Addressing scheme in direct mapped cache:

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 3

associated with least significant

Cache with 1024 1-word blocks: byte offset

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 7

• Cache replacement in large (multiword) blocks:

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 9

FIGURE 5.11 Miss rate versus block size.

• Miss rate falls at first with increasing block size as expected,

Spatial locality among the words in a block

Read the whole paragraph associated with

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 11

– what happens if we increase block size?

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 12

[CPI = clock cycles per instruction ]

• With doubled clock rate, miss penalty = 2  40 = 80 clock cycles

AMAT = Time for a hit + Miss rate  Miss penalty

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 17

[2] William Stallings, (2010), Computer Organization and

[3] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Naraig

Dr. Md. Waliur Rahman Miah Dept of CSE, DUET 19

You might also like