You are on page 1of 27

CACHE MEMORY

Today's high performance microprocessors operate at


speeds that far outpace even the fastest of the memory bus
architectures that are commonly available. One of the
biggest limitations of main memory is the wait state:
period of time between operations. This means that during
the wait states the processor wait for the memory to be
ready for the next operation. The most common technique
used to match the speed of the memory system to that of the
processor is caching.

Figure 1: Processor – Memory Performance Gap [1]

Cache Memory is the level of computer memory hierarchy


situated between the processor and main memory. It is a
very fast memory the processor can access much more quickly
than main memory or RAM. Cache is relatively small and
expensive. Its function is to keep a copy of the data and
code (instructions) currently used by the CPU. By using
cache memory, waiting states are significantly reduced and
the work of the processor becomes more effective. [2]
Figure 2: Memory Hierarchy [3]

Cache is much faster than main memory because it is


implemented using SRAM (Static Random Access Memory). The
problem with DRAM, which comprises main memory, is that it
is composed entirely of capacitors, which have to be
constantly refreshed in order to preserve the stored
information (leakage current). Whenever data is read from
the cell, the cell is refreshed. The DRAM cells need to be
refreshed very frequently, i.e. typically every 4 to 16 ms.
This slows down the entire process. SRAM on the other hand
consists of flip-flops, which stay in its state as long as
the power supply is on. (A flip-flop is an electrical
circuit composed of transistors and resistors. See picture)
Because of this SRAM need not be refreshed and is over 10
times faster than DRAM. Flip-flops, however, are
implemented using complex circuitry which makes SRAM much
larger and more expensive, limiting its use. [4]
Figure 3: Flip – flop [5]

Principles of cache memory

In general cache memory works by attempting to


predict which memory the processor is going to need next,
and loading that memory before the processor needs it, and
saving the results after the processor is done with it.
Whenever the byte at a given memory address is needed to be
read, the processor attempts to get the data from the cache
memory. If the cache doesn’t have that data, the processor
is halted while it is loaded from main memory into the
cache. At that time memory around the required data is
also loaded into the cache. [6]

Figure 4: Cache Read Operation [1]


The basic principle that cache technology is based upon is
locality of reference – programs tend to access only a
small part of the address space at a given point in time.
This notion has three underlying assumptions– temporal
locality, spatial locality and sequentiality.

Temporal locality means that referenced memory is likely to


be referenced again soon. In other words, if the program
has referred to an address it is very likely that it will
refer to it again.

Spatial locality means that memory close to the referenced


memory is likely to be referenced soon. This means that if
a program has referred to an address, it is very likely
that an address in close proximity will be referred to
soon.

Sequentiality means that the future memory access is very


likely to be in sequential order with the current access.
[7] [8] [9]

How does cache memory work?

Main memory consists of up to 2n addressable words, with


each word having a unique n-bit address. For mapping
purposes, this memory is considered to consist of a number
of fixed-length blocks of K words each. Thus, there are M
= 2n/ K blocks. The cache is split into C lines of K words
each, Figure with number of lines considerably smaller than
the number of main memory blocks (C << M). At any time,
only a subset of the blocks of main memory resides in the
lines in the cache. If a word in a block of memory is
read, that block is transferred to one of the lines of the
cache. Since there are more blocks than lines, an
individual line cannot be uniquely and permanently
dedicated to a particular block. Therefore, each line
includes a tag that identifies which particular block of
main memory is currently occupying that line of cache. The
tag is usually a portion (number of bits) of the main
memory address [1]
Figure 5: Cache / Main – Memory
Structure [1]

Cache’s effectiveness:

The effectiveness of the cache is determined by the number


of times the cache successfully provides the required data.
This is the place to introduce some terminology. A hit is
the term used when the data, required by the processor is
found in the cache. If data is not found, we have a cache
miss. There are three types of misses:

Compulsory misses – the first access to a block is not in


the cache, so the block must be brought into the cache.
These are also called cold start misses or first reference
misses.
Capacity misses – if the cache cannot contain all the
blocks needed during execution of a program, capacity
misses will occur due to blocks being discarded and later
retrieved. These are misses due to cache size.
Conflict misses – if the block placement strategy is set
associative or direct mapped, conflict misses will occur
because a block can be discarded and later retrieved if too
many blocks map to its set. These are also called collision
misses or interference misses. [10]

The time it takes to access cache on a hit is the hit time.


If however, the CPU does not find the required data in the
uppermost level of the memory hierarchy, in other words,
when there is a miss, then the data is fetched from the
lower levels of memory and replicated into the uppermost
level. The time needed to fetch the block from the lower
levels, replicate it and then access it, is called miss
penalty. Hit ratio is the percentage of hits from total
memory accesses. Miss ratio is equal to (1-hit ratio). One
way to improve cache effectiveness is to reduce the number
of misses and increase the number of hits; in other words
we strive for high hit rate and low miss rate (1-hit rate).
[11] [8]

Figure 6: Cache Memory Example [12]

In order to improve the cache performance we need to


increase hit ratio. Many different techniques have been
developed and the following are the most commonly used ones.
Figure 7: Elements of Cache Design [1]

Cache Size:

One way to decrease miss rate is to increase the


cache size. The larger the cache, the more information it
will hold, and the more likely a processor request will
produce a cache hit, because fewer memory locations are
forced to share the same cache line. However, applications
benefit from increasing the cache size up to a point, at
which the performance will stop improving as the cache size
increases. When this point is reached, one of two things
have happened: either the cache is large enough that the
application almost never have to retrieve information from
disk, or, the application is doing truly random accesses,
and therefore increasing size of the cache doesn't
significantly increase the odds of finding the next
requested information in the cache. The latter is fairly
rare - almost all applications show some form of locality
of reference.

In general the cache has to be small enough so that


the overall average “cost per bit” is close to that of main
memory alone and large enough so that the overall average
access time is close to that of the cache alone. There are
several other reasons to minimize the size of the cache.
The larger the cache, the larger the number of gates
involved in addressing the cache. The result is that large
caches tend to be slightly slower than small ones – even
when using the same integrated circuit technology and put
into the same place on the chip. Cache size is also
limited by the available space on the chip.

Typical cache sizes today range from 1K words to


several mega-words. Multitasking systems tend to favor 256
K words as nearly optimal size for main memory cache. The
performance of the cache is very sensitive to the nature of
the workload, making it impossible to arrive at a single
"optimum" cache size. [13]

Mapping function:

The mapping function gives the correspondence between


main memory blocks and cache lines. Since each cache line
is shared between several blocks of main memory (the number
of memory blocks >> the number of cache lines), when one
block is read in, another one should be moved out. Mapping
functions minimize the probability that a moved-out block
will be referenced again in the near future. There are
three types of mapping functions: direct, fully
associative, and n-way set associative.

Direct Mapping: In direct mapped cache each main memory


block is assigned to a specific line in the cache according
to the formula i = j modulo C, where i is the cache line
number assigned to main memory block j, and C is the number
of lines in the cache. Direct mapping cache interprets a
main memory address (comprising of s + w bits) as 3
distinct fields.
Figure 8: Direct Mapped Cache Address [1]

» Tag (the most significant s - r bits) - a unique


identifier for each memory block. It is stored in the
cache along with the data words of the line specify one of
the 2s blocks.
» Line number (middle r bits) – specifies which line will
hold the referenced address.Line number identify one of the
C = 2r lines of cache.
» Word (least significant w bits) - specifies the offset of
a byte within a line or block. Word specifies the specific
byte in a cache line that is to be accessed.[1]

Figure 9: Direct Mapped Cache Access[1]

For every memory reference that the CPU makes, the tag of
the cache line that is supposed to hold the required block
is checked to see if the correct block is in the cache.
Since no two blocks that map into the same line have the
same tag, the cache controller can determine if we have a
cache hit or a cache miss.

Direct mapped cache is the simplest form of cache. It is


easy and relatively inexpensive to implement, and
determining whether a main memory block can be found in
cache is simple and quicker than with other mapping
functions. However, it has one significant disadvantage -
each main memory block is mapped to a specific cache line.
Through locality of reference, it is possible to repeatedly
reference blocks that map to the same line number. These
blocks will be constantly swapped in and out of cache,
causing the hit ratio to be low. Although such swapping is
rare in a single-tasking system, in multi-tasking systems
it can occur quite often and thus slow down a direct-mapped
cache. In general the performance is worst for this type of
mapping.

Fully Associative Mapping: Fully associative cache allows


each block of main memory to be stored in any line of the
cache. Main memory address (s + w bits) is divided into two
distinct fields:tag (s bits) and word offset (w bits). [1]

Figure 10: Fully Associative


Cache Address[1]

The cache controller must now check the tag of every line
in the cache to determine if a given memory address is
currently resident in the cache. This is not done using
common searching techniques but rather through the use of
associative memory (also called Content Addressable Memory
(CAM)). CAM basically allows the entire tag memory to be
searched in parallel. Unlike typical RAM, CAM associates
logic with each memory cell in the memory. This allows for
the contents of all memory cells in the CAM to be checked
in a single clock cycle. Thus access to the CAM is based
upon content not address as with ordinary RAM. CAMs are
considerably more expensive in terms of gates than ordinary
access by address memories (RAMs) and this limits their use
(at least with current technologies) to relatively small
memory systems such as cache.

Figure 11: Fully Associative


Cache Access[1]

The cache controller will need to uniquely identify every


main memory block, which will require s bits since there
are 2s blocks in the main memory. As before, within the
block, w bits will be required to uniquely identify a
particular byte within a specific block. Since any cache
line can hold any main memory block at any time, the cache
controller needs to perform a simultaneous search over all
tags to find the desired line (to see if it is in the
cache). This is where the CAM comes into play. The entire
contents of the cache can be compared simultaneously with
the CAM. A cache hit is indicated when one of the CAM
cells contents matches the search address. If none of the
CAM cells contents matches the search address then a cache
miss has occurred and the proper block from main memory
will need to be loaded into one of the cache lines. At
this point the cache controller will invoke the line
replacement algorithm, which will select which cache line
is to be replaced to make room for the new incoming block.
[13]

Fully associative mapping is very flexible and overcomes


direct mapping’s main weakness. The fully associative cache
has the best hit ratio because any line in the cache can
hold any address that needs to be cached. However this
cache suffers from problems involving searching the cache.
Even with specialized hardware to do the searching, a
performance penalty is incurred. And this penalty occurs
for all accesses to memory, whether a cache hit occurs or
not. In addition, more logic must be added to determine
which of the various lines to use when a new entry must be
added (usually some form of a "least recently used"
algorithm is employed to decide which cache line to use
next). All this overhead adds cost, complexity and
execution time. That is why, fully associative cache is
very rarely used.

Set Associative Cache is basically a good compromise


between direct mapping and fully associative mapping. It
builds on the strengths of both: namely, the easy control
of the direct mapped cache and the more flexible mapping of
the fully associative cache.

In set associative mapping, cache is divided into a number


of smaller direct-mapped areas called sets (v), with each
set holding a number of lines (k).The cache is then
described in the number of lines each set contains. If a
set can hold X lines, the cache is referred to as an X-way
set associative cache. A main memory block can be stored in
any one of the k lines in a set such that cache set number
= j modulo v (where j is the main memory block number).

Direct mapping cache interprets a main memory address


(comprising of s + w bits) as 3 distinct fields: Tag (s - d
bits), Set (d bits), and Word (w bits). [1]
Figure 12: Set-AssociativeCache Address[1]

Figure 13:Set-Associative Cache Access[1]

For each main memory group, the cache is capable of holding


v different main memory blocks simultaneously. A four-way
set associative cache could hold up to four main memory
blocks from the same group simultaneously, while an eight-
way set associative cache would be capable of holding up to
eight main memory blocks from the same group
simultaneously, and so on.

whenever the CPU issues a block request, the cache


controller for the k-way set associative cache will need to
check the contents of k different cache lines for the
particular group to which the required block belongs. Once
again, the use of CAM will allow the contents of all cache
lines to be checked in parallel. If one of the "k" cache
lines contains the requested block, a cache hit occurs,
otherwise a cache miss occurs and one of the two cache
lines belonging to that group will need to be selected for
replacement. In general, set associative mapping has better
hit ratio than direct mapped cache and is faster than fully
associative, therefore providing better performance than
the other two mapping functions. 2 and 4 way associative
caches give the best ratio between hit ratio and search
speed.[13] [14]

Summary: In the "real world", the direct mapped and set


associative caches are by far the most common. Direct
mapping is used more for level 2 caches on motherboards,
while the higher-performance set-associative cache is found
more commonly on the smaller primary caches contained
within processors.

Cache Type Hit Ratio Search Speed


Direct Mapped Good Best

Fully
Best Moderate
Associative

Very Good,
N-Way Set Good, Worse as N
Better as N
Associative, N>1 Increases
Increases

Figure 14: Mapping Function Comparison Table[14]

Cache Line Replacement Algorithms :

When a new line is loaded into the cache, one of the


existing lines must be replaced. In a direct mapped cache,
the requested block can go in exactly one position, and the
block occupying that position must be replaced. In an
associative cache we have a choice of where to place the
requested block, and hence a choice of which block to
replace. In a fully associative cache, all blocks are
candidates for replacement. In a set associative cache, we
must choose among the blocks in the selected set. Therefore
a line replacement algorithm is needed which sets up well
defined criteria upon which the replacement is made. A
large number of algorithms are possible and many have been
implemented. Four of the most common cache line
replacement algorithms are:

• Least Recently Used (LRU) - the cache line that was


last referenced in the most distant past is replaced.
• FIFO (First In- First Out) - the cache line from the
set that was loaded in the most distant past is
replaced.
• LFU (Least Frequently Used - the cache line that has
been referenced the fewest number of times is
replaced.
• Random - a randomly selected line form cache is
replaced

The most commonly used algorithm is LRU. LRU replacement is


implemented by keeping track of when each element in a set
was used relative to the other elements in the set. For a
two-way set associative cache, tracking when the two lines
were used can be easily implemented in hardware by adding a
single bit (use bit) to each cache line. Whenever a cache
line is referenced its use bit is set to 1 and the use bit
of the other cache line in the same set is set to 0. The
line selected for replacement at any specific time is the
line whose use bit is currently 0. The principle of the
locality of reference means that a recently used cache line
is more likely to be referenced again, LRU tends to give
the best performance. In practice, as associativity
increases, LRU is too costly to implement, since tracking
the information is costly. Even for four-way set
associativity, LRU is often approximated – for example, by
keeping track of which of a pair of blocks is LRU (which
requires one bit), and then tracking which line in each
pair is LRU (which requires one bit per pair). For large
associativity, LRU is either approximated or random
replacement is used.

The FIFO replacement policy is again easily implemented in


hardware by the cache lines as queues. The LFU replacement
algorithm is implemented by associating with each cache
line a counter which increments on every reference to the
line. Whenever a line needs to be replaced, the line with
the smallest counter value is selected, as it will be the
cache line that has experienced the fewest references.

Random replacement is simple to build in hardware While it


may seem that this algorithm would be a poor replacement
line selection method, in reality it performs only slightly
worse than any of the other three algorithms that we
mentioned. For a two-way set associative cache, random
replacement has a miss rate only about 1.1 times higher
than LRU replacement. The reason for this is easy to see.
Since there are only two cache lines per set, any
replacement algorithm must select one of the two, therefore
the random selection method has a 50-50 chance of selecting
the same one that the LRU algorithm would select yet the
random algorithm has no overhead (i.e., there wouldn’t be
any use bit). As the caches become larger, the miss rate
for both replacement strategies falls, and the absolute
difference becomes small. In fact, random replacement is
sometimes better than simple LRU approximations that can be
easily implemented in hardware. [13]

Cache Write Policies:

Before a cache line can be replaced, it is necessary to


determine if the line has been modified. The contents of
the main memory block and the cache line that corresponds
to that block are essentially copies of each other and
should therefore hold the same data. If cache line X has
not been modified since its arrival in the cache, updating
the main memory block it corresponds to is not required
prior to its replacement. The incoming cache line can
simply overwrite the existing cache memory. On the other
hand, if the cache line has been modified, at least one
write operation has been performed on the cache line and
thus the corresponding main memory block must be updated.
Basically there are two different policies that can be
employed to ensure that the cache and main memory contents
are the same: write-through and write-back.

Write-through: Assuming a cache hit (a write hit), the


information is written immediately to both the line in the
cache and to the block in the lower-level memory (with its
normal wait-state delays). The advantages of this technique
are that the contents of the main memory and the cache are
always consistent, it is easy to implement and read miss
never results in writes to main memory. On the other hand,
the write through policy has a significant drawback. It
needs a main memory access, which is slower and results in
a more memory bandwidth usage. In spite of this, most Intel
microprocessors use a write-through cache design.

Write-back (sometimes called a posted write or copy back


cache): On a cache hit, the information is written only to
the line in the cache. This allows the processor to
immediately resume processing. The modified cache line is
written to main memory only when it is replaced. To reduce
the frequency of writing back blocks on replacement, a
dirty bit is commonly used. This status bit indicates
whether the block is dirty (modified while in the cache) or
clean (not modified). If it is clean the block is not
written on a miss. The advantages of the write-back policy
are that writes occur at the speed of the cache memory,
multiple writes within a block require only one write to
main memory, which results in less memory bandwidth usage.

Write-back is a faster alternative to the write-through


policy but it has one major disadvantage - the contents of
the cache and the main memory are not consistent. This is
the cache coherency problem and it is an active research
topic. The cache coherency becomes an issue, for example,
when a hard disk is read and information is transferred
into the main memory through the DMA (Direct Memory Access)
system, which does not involve the processor. The cache
controller must constantly monitor the changes made in the
main memory and ensure that the contents of the cache
properly track these changes to the main memory. There are
many techniques, which have been employed to allow the
cache controller to "snoop" the memory system - but once
again, these add complexity and expense. In the PC
environment there are special cache controller chips that
can be added which basically handle all the
responsibilities for supervising the cache system. The
cache coherency problem becomes acute in a bus architecture
in which more than one device (typically processors) has a
cache and main memory is shared.

Writes introduce yet another complication not present in


reads. If a write operation has been requested and a cache
miss results, again one of two options for handling the
write miss can be employed. The line may be brought into
the cache, and then updated, which is termed a write-
allocate policy, or the block may be updated directly in
main memory and not brought into the cache which is termed
a write−no allocate policy. Typically, write-through caches
will employ a write-no allocate policy while write-back
caches will utilize a write-allocate policy.

Write hit policy Write miss policy


Write Through Write Allocate
Write Through No Write Allocate
Write Back Write Allocate
Write Back No Write Allocate

Figure 15: Possible Combinations on write [15]

Write Through with Write Allocate: On hits it writes to


cache and main memory, on misses it updates the block in
main memory and brings the block to the cache. Bringing
the block to cache on a miss does not make a lot of sense
in this combination because the next hit to this block will
generate a write to main memory anyway (according to Write
Through policy)

Write Through with No Write Allocate: On hits it writes to


cache and main memory, on misses it updates the block in
main memory not bringing that block to the cache;
Subsequent writes to the block will update main memory
because Write Through policy is employed. So, some time is
saved not bringing the block in the cache on a miss because
it appears useless anyway.

Write Back with Write Allocate: on hits it writes to cache


setting “dirty” bit for the block, main memory is not
updated, on misses it updates the block in main memory and
brings the block to the cache. Subsequent writes to the
same block, if the block originally caused a miss, will hit
in the cache next time, setting dirty bit for the block.
That will eliminate extra memory accesses and result in
very efficient execution compared with Write Through with
Write Allocate combination.

Write Back with No Write Allocate: on hits it writes to


cache setting “dirty” bit for the block, main memory is not
updated, on misses it updates the block in main memory not
bringing that block to the cache. Subsequent writes to the
same block, if the block originally caused a miss, will
generate misses all the way and result in very inefficient
execution. [15]

Block/Line Size

Another element in the design of a cache system is that of


the line size. This is the number of bytes per cache line,
sometimes also referred to as block size. When a block of
data is retrieved from main memory and placed in the cache,
not only the requested word is loaded but also some number
of adjacent words (those in the same block) is retrieved.
As the block size increases from very small to larger
sizes, the hit ratio will at first increase because of the
principle of locality. However, as the block becomes even
bigger the hit ratio will begin to decrease because the
probability of using the newly fetched information will be
less than the probability of reusing the information that
has been replaced. Two specific effects need to be
considered:

Larger blocks reduce the number of blocks that will fit


into the cache. Because each fetch overwrites older cache
contents, a small number of blocks in the cache will result
in data being overwritten shortly after it has been loaded.

As a block becomes larger, each additional word is farther


from the requested word, and is therefore less likely to be
needed in the near future (principle of locality).

The relationship between block size and the hit ratio is


complex and depends heavily on the locality characteristics
of a particular program. No definitive optimum value has
been found. A size from two to eight words seems, in
practice, to work reasonably close to optimum. [13]

Number of Caches

When cache systems were first introduced, the typical


system had a single cache. More recently, the use of
multiple caches has become popular. There are two aspects
to the number of caches that are important, namely the
number of levels and whether the cache is unified or split.

Cache Levels: By adding several levels of cache between the


original cache and the processor, we can have caches that
are close enough to the processor so that they can match
its clock cycle time.

L1: Level 1 cache is often called primary cache. L1 cache,


located within the processor’s core (AMD and Intel
designs), is usually the smallest cache and and has a very
low latency as it is used extensively for all sorts of
purposes, such as data fetching, data shifting and data
loops, storing only small amounts of data. It is reachable
by the processor without external bus activity and
therefore contributes to execution speed-up and
minimization of bus traffic by the processor. L1 cache
operates at the same clock speed as the processor and is
usually implemented as a split cache (separate parts for
instruction and data - explained later). Current
Microprocessors have L1 caches that range in size from 16KB
to 128KB.

L2: Level 2 cache is often called the secondary cache. L2


cache stores much more data, usually coming from the L1
cache in multiples of the L1 cache size. Previously L2
caches were external to the processor, but modern
processors have integrated it directly on-chip. This trend
began with Pentium Pro, and has continued with Intel’s line
of microprocessors. The biggest advantage of moving the L2
cache on-chip and running it at full clock speed is that
both the L1 and the L2 cache can run in parallel and be
accessed concurrently, reducing latency. This is a good way
to improve cache performance, but it also has a couple of
disadvantages. One is that tuning for higher clock speeds
with a reduction in latency also reduces the potential
cache memory clock speed. A frequent L2 cache configuration
is between four and sixteen times that of the L1, depending
on configuration and desired latency. L2 is generally 256
or 512 KB.

L3: Newer microprocessors such as AMD K6-3, K7 and Intel’s


Pentium 3 designs offer three levels of caching. Level 3
caches are typically external to the microprocessor and
operate at the same speed as main memory, but are fast
enough to not impose wait states on the processors. [13]
Figure 16: Typical cache
configurations

While it is very difficult to determine quantitatively the


improvement a two-level cache represents compared to a
single level cache, studies have shown that, in general,
the second level does provide a performance improvement.

Unified vs. Split Caches: When Level 1 cache first


appeared, most designs consisted of a single cache that
held both data and instructions at the same time. This is
called a unified cache (Von Neumann Architecture). However,
contemporary systems usually split the L1 cache into two
separate caches (both still considered L1 caches): one
dedicated to instructions for the processor and the other
dedicated to the data on which the instructions will
operate. This approach is called a split cache (Harvard
architecture).

There are several potential advantages of a unified cache


For a given cache size, a unified cache is more flexible
and has a higher hit rate than split caches because it
balances the load between instructions and data fetched
automatically. That is, if a program execution requires
many more instruction fetches than data fetches, then the
cache will tend to fill up with instructions increasing the
probability that a cache hit will occur. If the execution
pattern, however, consists of more data fetches than
instruction fetches, the cache will tend to fill with data
which again increases the probability that a cache hit will
occur A further advantage of unified cache is that only one
cache needs to be designed and implemented.

The key advantage provided by the split cache is that cache


contention between the instruction processor and the
execution units is eliminated, which allows simultaneous
data and instruction access, resulting in higher bandwidth.
Although it reduces hit rate, this is not catastrophic if
L2 cache is available to keep miss penalties low. This is
very important in any system that supports pipelining and
parallel execution of instructions. If a unified cache is
used in such systems, when a simultaneous request occurs
for a memory load/store access and instruction prefetch,
the prefetch request will be blocked so that the unified
cache can service the execution unit first, enabling it to
complete the currently executing instruction. This cache
contention can degrade performance by interfering with
efficient use of the instruction pipeline. The split cache
structure overcomes this difficulty and provides more
bandwidth. [16]

Nowadays, the trend is toward using split L1 combined with


unified L2 for good aggregate performance, particularly in
superscalar machines such as the Pentiums and Athlons. In
the case of the Pentium, the L1 cache is split into two
caches called the I-cache (for Instruction-cache) and the
D-cache (for Data-cache). RISC machines that are based on
the Harvard architecture split the caches based upon the
assumption that the operating system separates code and
data in main memory. The Pentium processors do not make
this assumption, therefore no attempt is made to separate
the instructions from the data and both will appear in both
the I cache and the D cache. The reason for splitting the
cache in the Pentium is solely to eliminate the cache
contention problem.
Figure 17: Comparison of Miss
Rates [3]

Advanced Cache Techniques

There have been many different techniques used to enhance


the performance of cache memories. Many different hybrid
organization have been proposed and some implemented.
Sector mapping is one that is fairly commonly used and is
based upon the set associative mapping but rather than
mapping main memory blocks, main memory sectors are mapped
(this organization tends to be more closely tied to the
hardware than to the OS). Another of these techniques is
to utilize special memory for the cache. Highlights of
this technique are discussed below.

Burst Mode Caches

High performance microprocessors require the fastest


possible caches to minimize waiting. One technique for
speeding up the cache is to operate in a burst mode. Just
like in main memory, burst mode in a cache eliminates the
need to send a separate address for each memory read or
write operation. Instead, the cache reads or writes a
contiguous sequence of addresses in a quick burst.
Depending upon whether the system is reading or writing,
operation in burst mode can cut cache access time just over
50%. The largest improvement occurs when performing write
operations in a burst. Ordinary static RAM chips do not
have fast enough response times to support cache operations
in burst mode. As a result, three special types of SRAM
have been developed to support burst mode cache operations.

Asynchronous SRAM: The RAM is called asynchronous because


the processor has to provide an address for each cache
access and then has to wait. This is an obsolete type of
cache, found in 386 and 486 machines only.
Synchronous Burst SRAM: uses an internal clock to count up
to each new address after each memory operation. Since the
chip automatically increments the address, it doesn't
require the processor to send it the next address. Since
this type of cache chip runs at the same speed as the
processor, the timing constraints on the chip are critical
for fast, error-free operation. This causes the
Synchronous Burst SRAM to be inherently more expensive than
conventional asynchronous cache designs.

Pipelined Burst SRAM: The very tight timing constraints of


Synchronous Burst SRAM not only make it more expensive but
also more difficult to manufacture. Pipelined Burst SRAM
achieves the same level of performance but without the need
for a synchronous internal clock. This type of cache chip
includes an internal register that holds the next chunk of
data in the sequence to be read. While the register holds
this value, the chip continues to run internally and
accesses the next address to load the pipeline. As soon as
the processor reads the output register, the pipeline can
unload the data from the next address and place this data
into the register to be ready for the next read operation.
Since the pipeline in the chip keeps a supply of data
always ready, this form of memory can run as fast as the
processor requires data, limited only by the access time of
the pipeline register. [13]

Victim Cache: Higher associativity and larger block size


are two classic techniques to reduce miss rates. There are
however more recent methods that aim to reduce cache miss
rate without affecting the clock cycle or the miss penalty.
One solution is to add a small, fully associative cache
between a cache and its refill path. This cache (also known
as victim cache) contains only blocks that are discarded
from a cache because of a miss-"victims". They are checked
on a miss to see if they have the desired data before going
to the next lower-level memory. If the data is found there,
the victim block and cache block are swapped. Otherwise the
new "victim" is put into the victim cache. Misses served by
the victim caches have only a very small miss penalty,
typically one cycle, as opposed to several cycles for main
memory. "Victim caches" tend to increase effective
associativity of a small direct mapped cache, which results
in reduction of conflict misses. They are less effective
for large cache sizes. Victim caches are not widely used as
modern processors have large caches, and large victim
caches are extremely difficult to build. [16]
Figure 18: Victim Cache
[17]

Pseudo-associative: Another approach for getting the miss


rate of set-associative caches and the hit speed of direct
mapped is called pseudo-associative or column associative.
When a hit occurs the cache access proceeds just as in the
direct-mapped cache. On a miss, however, before going to
the next lower level of the memory hierarchy, another cache
entry is checked to see if it matches there. A typical way
of finding the next block in the "pseudo set" to probe is
to invert the most significant bit of index filed. Pseudo-
associative caches then have one fast and one slow hit time
-corresponding to a regular hit and a pseudo hit. A problem
with "pseudo-associative cache" may arise if many of the
fast hit times of the direct-mapped cache become slow hit
times in the pseudo-associative cache- then the performance
would be reduced. Therefore it is important to indicate for
each set which block should be the fast hit and which
should be the slow one; one way is simply to swap the
contents of the blocks. Pseudo-associative caches improve
miss rates and reduce the effect of conflict misses without
affecting the processor clock rate. [18]
Prefetching: Victim caches and pseudo-associativity both
promise to improve miss rates without affecting the
processor clock rate. Prefetching is another technique that
predicts soon-to-be used instructions or data and loads
them into the cache before they are accessed. Subsequently,
when the prefetched instructions or data are accessed there
is a cache hit, rather than a miss.

A commonly used method is sequential prefetching, where it


is predicted that data or instructions immediately
following those currently accessed will be needed in the
near future and are prefetched. Sequential prefetching
fetches a memory block that caused a cache miss, along with
the next n consecutive cache blocks. Sequential prefetching
can be more or less "aggressive" depending on how far ahead
in the access stream it attempts to run - how large n is.

Both instructions and data can be prefetched, either


directly into the caches or into an external buffer that
can be more quickly accessed than main memory. An example
of this is the Alpha AXP 21064 processor, which fetches two
blocks on a miss: the requested block and the next
consecutive block. The requested block is placed in the
instruction cache when it returns, and the prefetched block
is placed into the instruction stream buffer. If the
requested block is present in the instruction stream
buffer, the original cache request is cancelled, the block
is read from the stream buffer and the next prefetch
request is issued.

Although usually beneficial, prefetching, especially


aggressive prefetching may reduce performance. In some
cases prefetched data or instructions may not actually be
used, but will still consume system resources - memory
bandwidth and cache space. It is possible that a prefetch
will result in useless instructions or data replacing other
instructions or data in the cache that will soon be needed.
This effect is referred to as cache pollution. The effects
of cache pollution most often increase as prefetching
becomes more aggressive. Another prevalent effect of
aggressive prefetching is bus contention. Bus contention
occurs when multiple memory accesses have to compete to
transfer data on the system bus. This effect can create a
scenario where a demand-fetch is forced to wait for the
completion of a useless prefetch, further increasing the
number of cycles the processor is kept idle.
On average, prefetching benefits performance but still may
not be optimal for any particular program. There is a
software alternative to hardware prefetching called
compiler-controlled prefetching - the compiler requests the
data before it is needed. This technique can improve memory
access time significantly if the compiler is well
implemented and can avoid references that are likely to be
cache misses. [19]

Write Buffer: Memory write schemes can be improved by


incorporating write buffer between either cache and main
memory or between L1 cache and L2 cache. The processor
writes data into the cache and the write buffer and after
that the memory controller writes contents of the buffer to
memory. When used with write through, this technique
reduces processor stalls on consecutive writes. This allows
cache access while several memories write operations
proceed. With write back cache – the write buffer can hold
block during multi-cycle write to memory. This allows cache
to be used while waiting for write when block size is
bigger than transfer size.[20]

Figure 19: Write Buffer between Main Memory and


Processor[20]

Figure 20: Write Buffer between L2 Cache and


Processor[20]