You are on page 1of 6

Advanced cache Architecture for Embedded Systems

Cache Coherence and Consistency


Patel, Jay Dalvi, Siddesh
Saldanha, Brandon
BITS Pilani, K. K. Birla Goa Campus BITS Pilani, K. K. Birla Goa Campus
BITS Pilani, K. K. Birla Goa Campus
India India
India
Vaidya, Shail
BITS Pilani, K. K. Birla Goa Campus
India

Abstract – In multi core systems, cache coherence plays an Invalid(I), Owned(O), Shared(S). The S state allows the
important role to ensure uniformity of shared data, copies of Multiprocessor system to share the same block of data, in
which are stored in multiple local caches. The performance of the E state only one processor can have the block, In the M
the processor depends significantly on cache coherence or E state a processor has a modified data block. M state
protocols and they play a very important role in centralized
implies all updated data block exists only in the cache of a
and distributed memory of multiprocessors. Updating
processor data, broadcasting changes in valid data to other processor while in O state thee data is shared with other
cores and main memory to prevent loading invalid values are processors. Depending on the complexity and the
important steps in any cache coherence protocols. Several requirements different states can be combined to establish a
contributions to cache coherence protocols have been made to cache coherence protocol [3- Paper 1].Example of
cache coherence protocols such as using invalidation-based Invalidation based protocols- MSI, MESI, MESIF, MOSI,
protocols with a write through cache coherence. This paper MOESI, etc.
gives an overview of such different protocols.
In the update based protocol, data is broadcasted each
Keywords – Cache coherence; Cache coherence protocols; time a processor updates shared block memory, so all
MSI; MESI; invalidation-based protocol; update-based protocol; processor have all data which is up-to-date. The Update-
Dragon Firefly protocol based protocol extends the states of the Invalidation-based
Cache coherence; Cache coherence protocols; protocol with two additional states, the Shared Clean and
invalidation-based protocol; update-based protocol; MSI; MESI; Shared Modified [3]. The Shared Modified state indicates
Dragon; Firefly protocol that the main memory is not coherent.
The following criteria must be met for cache coherence
I. INTRODUCTION to occur:
In a shared memory multiprocessors system with a
1) When Processor X executes a read and write
separate cache memory for each processor, it is possible to
operation on N location, but there is no other processor
have many copies of the operand, such as one copy might be
which operates on location N, so the final value written on
residing in the main memory while other copies reside in the
location N should be from the processor X. Fig. 1 shows the
cache memory. When one copy of an operand changes then
operation of processor X on Location N.
the other copies should also be updated. Thus, cache
coherence ensures that changes in all operands are
propagated throughout the system in a timely fashion.[1]
Cache coherence is the consistency of shared and stored the
data in multiple local caches [2].
To overcome this problem, Cache coherence protocols
are implemented, these protocols help to enforce desired
memory model and make sure that the latest values of the
operand are returned when a memory location is accessed.
They ensure Data consistency throughout the system. Cache
coherence can be achieved via Hardware as well as software
solutions. The main idea being, the updates made by one
processor should be communicated to all others and there
2) If the processor X executes a read operation on
should be a consistent view of the memory of all processors.
location then processor Y request a write operation on the
same location, the final value should be the value which is
II. BACKGROUND modified by the processor Y. This is referred to as cache
Multiprocessors with shared buses adopts a snoopy coherency: for any update on a specific location, only the
based protocols. There are two types of snoopy based latest version of the modified value must be read by all
protocols: Invalidation based protocols and Update - Based processor.
Protocols. In the Invalidation based protocols there are
different states, each block can be in one of these states.
There five main states are: Modified(M), Exclusive(E),
III. CACHE COHERENCE PROTOCOLS

This section presents a study of cache coherence protocols.


Archibald et al., presented six different cache coherence
protocols, where multiprocessor simulation models were
used as a performance metric. This simulation was
implemented in Simula which is a basic multiprocessor
model. These protocols can be grouped into two: update
based protocols and invalidate based protocol. In
invalidation-based protocols, all other cached copies of the
3) When the two processors execute a write operation data are invalidated after a cache is written to. On the other
on the same location which is location N, the final value hand, in update-based protocols write action is broadcasted
must be the last written on the location N. to the other cached copies. The invalidation-based protocols
which have been studied in Archibald et al.’s work are:
Synapse, Goodman’s Write-Once, Illinois and Berkeley-
Ownership. The two update-based protocols are Dragon,
Firefly. In this paper, we compare and describe the
following protocols MESI protocol, MESIF Protocol, SCI
Protocol, Word-Invalidate Protocol and Firefly protocols,
word-invalidate protocol, SCI protocol.[3]

A. MESI Protocol
The MESI protocol is based on the four states which a
block in the cache memory can have. These four states are
the abbreviations for MESI: modified, exclusive, shared and
invalid. States are explained below:
Multi-Level Cache can also result in cache incoherency
and may require complex cache coherence protocols. There Invalid: It is a non-valid state. The data you are looking
are two ways to deal with the writes to a cache: for are not in the cache, or the local copy of these data are
not correct because another processor has updated the
1) Write-through: all data written to the cache are corresponding memory position.
instantaneously updated in the main memory as well [2].
Shared: Shared without having been modified. Another
2) Write back: here, the write operations are mostly processor can have the data into the cache memory and both
carried out in the cache. Thus the main memory is modified copies are in their current version. Exclusive: Exclusive
only when the corresponding cache line is flushed [2]. without having been modified. That is, this cache is the only
one that has the correct value of the block. Data block are
The most widely used cache coherence protocols are: according to the existing ones in the main memory.
1. MI (Modified (M), Invalid (I)) Modified: Actually, it is an exclusive-modified state. it
2. MSI (Modified (M), Shared (S), Invalid (I)) means that the cache has the only copy which is correct in
the whole system. The data which are in the main memory
3. MESI (Modified (M), Exclusive (E), Shared (S), and are wrong. [4]
Invalid (I))
4. MOSI (Modified (M), Owned (O), Shared (S), and
Invalid (I))
5. MOESI (Modified (M), Owned (O), Exclusive (E),
Shared(S), Invalid (I))
6. MESIF (Modified (M), Exclusive (E), Shared (S),
Invalid (I), Forward (F))
7. MESRSI (Modified)(M), Exclusive (E), Shared (S),
Initially when the cache is empty and a block of memory
Recent or Read Only (R), Invalid (I))
is written into the cache by the processor, this block has the
8. Write once, and Synapse, Berkeley, Firefly and exclusive state, because there are no copies of the block in
Dragon Protocols. [3] the cache. Then, if the block is written the state is changed to
Modified and the block which is in the main memory is
This Paper is an overview of different types of Cache different from it, on the other hand, if a block is in the
Coherence protocols that are implemented in multi-processor exclusive state, when the CPU tries to read it and it does not
systems. We will discuss these protocols in the following find the block, it has to find it in the main memory and loads
section II: it into its cache memory. Then, that block is in two different
caches so its state is shared. Then, if the CPU has to write in
the block which is in modified state. The block has to be
cleared from the cache it was present in and the latest value cache line is modified. Then, processor A sends a cread64
has to be loaded into the main memory because it was the request packet to processor B's cache, obtains the data in the
most current version of the block in the system. Now if the response packet and becomes the new head (owner) of the
CPU tries to read it, the block is not found in the Cache and sharing list as shown on the right-hand side of the figure. In
has to be loaded from the main Memory which has the last the typical set protocol, five instructions are defined by
updated values. This way we make sure that the processor is which a processor may access the shared memory. In
going to use valid data only. We do not have to worry if a addition to executing a Load or Store instruction, a processor
processor has changed data from the main memory and has may Delete itself from a sharing list, Flush (i.e. purge) the
the most current value of these data in its cache. With the
whole sharing list or Lock the memory line. According to the
MESI protocol, the processor obtains the most current value
standard, these instructions are executed in four phases -
every time it is required.[5]
namely allocate, setup, execute and cleanup. The three
distinct behaviors of processors, caches and memories are
B. SCI Protocol defined separately from each other in the C-code. According
to this definition, the execution of the routines implementing
Shared-memory multiprocessors are commonly deemed cache and memory behavior is performed atomically.
to be easier to program than distributed multiprocessors, However, the execution of a routine modeling the execution
where the communication takes place via message passing. of an instruction by a processor may be non-atomical. For
However, the latter are easier to implement in hardware. A example, after processor A in the above example sent out its
solution to this problem is a distributed shared-memory mread64 request packet to the memory, processor B may
multiprocessor, which provides shared memory at the start a Delete instruction, processor C may continue its Lock
software level, while the actual hardware implementation is a instruction in progress, etc. [7]
distributed message passing system. The IEEE Standard for
Scalable Coherent Interface (SCI) includes a protocol for
maintaining cache coherence among the distributed C. MESIF Protocol
components in such a distributed shared-memory The MESIF protocol scales by introducing a hierarchical
multiprocessor. model that implements the 2-hop protocol at multiple levels.
An SCI node may contain a processor- consisting of Despite the fact that some operations may require multiple
(multiple) execution units and a cache - and may contain a hops over different levels of the hierarchy, this latency can
memory. The SCI nodes communicate via transactions, each largely be overcome by the use of caches to maintain critical
consisting of a request packet and a response packet. In this coherence information at critical points in the network,
simplified description, echo packets are not taken into meaning that most common memory operations can still be
account. A distributed shared-memory multiprocessor can be performed with only two hops. Thus MESIF can support
assembled out of these nodes.[6] higher performance than a snooping protocol for a small
In a cache coherent SCI system, where snooping is not number of nodes, but is capable of scaling to a large number
possible, for each memory line a list of all caches that have a of nodes, like a directory-based scheme. The protocol is
copy of this line has to be maintained. In an SCI system, this designed to exploit the fact that shared data can be retrieved
from another cache faster than it can be fetched from
"sharing list" is distributed among the system
memory. Shared data, as well as modified data, is supplied
components. This is illustrated in Fig. 2. The left-hand side out of a cache rather than fetching it from memory whenever
of this figure shows a sharing list of cache lines in processors possible. [8]
B and C and the corresponding memory
In this Protocol A cluster consists of a collection of
line. The pointers for the sharing lists are stored in additional
nodes. Each node may have either a cache controller (with
bits (tags) in each memory and cache line. The current states
attached processor), or main memory, or both. For every
of the memory and cache lines are cache line address, one node (called Home) contains the
also stored in these tags. main memory for that address. These two functions (memory
and cache) are independent, but co-exist in the node. In the
hierarchical model, an entire cluster of nodes consisting of
processors, caches, and memory will be treated as single
node that consists of the same two entities. We use the term
"Home" to refer to the memory and supporting controller.
In general, messages are directed to one entity or the
other. For the messages in a typical transaction, specifically,
the first (broadcast) message is sent to the node (cache
controller), not Home, while the third is directed to Home,
not the node. Home does not respond to the broadcast
message, though it may use it to initiate a speculative fetch.
We now give an example of a typical execution sequence The protocol names imply that Home does see the initial
in the cache coherence protocol. If processor A on the left- request: the later message is sent to Home either confirming
or canceling the original request. If the Home node has no
hand side of the figure is executing a Load instruction and
caching capability, the protocol can be optimized slightly:
wants to read data from the memory line that is shared by
Home may be sent the initial broadcast message (or not), but
processor B and C, it first issues an reread64 request packet it does not respond until requested in a subsequent phase. [8]
to the memory and is notified in the response packet that
processor B has the data. Assume the data in processor B's
A cache line may be in one of five states in each of the One of the following responses will be returned from
nodes: M (Modified), E (Exclusive), S (Shared), I (Invalid), each node except Home:
or F (Forwarding). The first four states correspond to the 1) IACK: invalid state acknowledgement,
classic states of a MESI protocol. The Forwarding state
2) SACK: shared stated acknowledgement (no data
indicates a "first-among-equals," that is, a state assigned to a sent),
single node among those sharing a cache line, and
responsible for providing it when a request is received. This 3) Data and State: data sent along with state (F, E, or M)
state is similar to an "Owned" state, but coherent with to transition to, and
memory and useful for read requests, not just write requests. 4) Conflict: there is already a pending request for the
The Forwarding state in MESIF primarily facilitates the same cache line.
rapid response of a cached copy (requestor broadcasts and The Home is permitted to respond immediately upon
responder provides a cached copy for a 2-hop latency) in the receiving a READ or CNCL message from the requestor. It
presence of multiple cached copies. The F state also must not wait for all conflicting requests to arrive, because
simplifies the conflict resolution mechanism in the protocol. there are cases where some cannot be generated until it has
A transaction typically consists of five sets of messages, responded.
which may be partially overlapped. In many cases, one of the
first response messages returns data, so the requesting node
can quickly acquire and (provisionally, depending on the
memory model) use the data before the transaction is D. Firefly Protocol
completed. For all memory models, the data can be used Firefly is a multi-chip cache coherence solution. A small,
after the fourth message. direct-mapped cache is not usually consistent with high
A transaction is initiated by a broadcast to all nodes, performance. The Firefly is unusual in that the primary
including the Home node. All nodes must respond. A purpose of its cache is not to reduce the average access time
response may include data (from at most one node), and to main memory. Instead, its purpose is to reduce the per
indicates: processor load on the main storage system. This reduction in
1) The state of the cache line in the responding cache, bus loading makes it possible for a relatively low-
and performance bus to support a number of processors. Another
2) An indication if a conflicting request has been unusual aspect of the Firefly is that it provides a global
detected. shared memory in which data written by one processor are
In order to guarantee the serializability of transactions, a immediately available to the other processors in the system.
node supplying data may not respond to any further requests The need for coherent main memory makes it impossible to
for the same cache line until it has received use a standard memory bus or storage modules, but this
acknowledgement from the requesting node. The requesting factor was outweighed by the simplification of the system
node may use the data, but it is not allowed to acknowledge software that coherent memory provides.
receipt from the sender until it receives acknowledgement
from Home.
After responses have been received from all nodes, a
second message (second phase) is sent to Home. This
message, either a READ or CNCL requests data (a READ)
or indicates that it has received data (CNCL). In either case,
the message must also enumerate all conflicting requests it
has detected for this cache line between the time it initiated
the request and the time it received the last response. If a
conflict has been detected, the node signals either that it was
the winner (CNCL) or a loser (READ). Home must respond
to a READ request by sending either the data from memory
or instructing another node to forward the data (for conflict
cases). In the case of a conflict, Home must identify all
parties to the conflict and provide for the serial transfer of the
cache line to each of the requesting nodes. It does this by
sending forwarding messages to the winner and all but one of
the losing nodes in the conflict, also informing each losing
node to await data from a specified node. If there is no
conflict, Home responds to a CNCL by acknowledging The most important feature of the Firefly caches is
(ACK) receipt of the message and thus terminating the
that they provide a global shared memory in which data
transaction.
written by one processor are immediately available to other
A transaction may be initiated by any of the following processors. This is accomplished using coherent, or snoopy,
requests. caches, which monitor the memory bus traffic and take or
1) Read Shared (RS) supply data as necessary to maintain coherence. The Firefly
2) Read for Ownership (RFO) cache takes advantage of the temporal locality of programs,
3) Write back (dirty eviction) but since its line size is limited, it cannot take advantage of
4) Invalidation request (upgrade to ownership) spatial locality.[9]
E. Word-Invalidate Protocol is assumed. In that case, the protocol uses seven states to
According to the above, low to medium-scale shared define the status of a block:
memory, bus-based multiprocessors which execute programs INV - Invalid, or not present
with sequential sharing are a very attractive solution for IWl - Partially valid block, with one invalid word
commercial products. An SCI node may contain a processor- IW2 - Partially valid block, with two invalid words
consisting of (multiple) execution units and a cache - and U N M O D - E X C - Unmodified-exclusive
may contain a memory. U N M O D - S H D - Unmodified-shared
Because of spatial locality, data in cache is fetched in M O D - S H D - Modified-shared
blocks, in order to reduce the average access latency. M O D - E X C - Modified-exclusive
Most of the previous write-invalidate schemes use data
block also as a coherency unit. When a word is to be
updated in some cache, the entire blocks of all other
same cached copies are completely invalidated, although
only one word becomes stale. This invalidation is simple,
but not discriminative enough, since the valid rest of the
block is unnecessarily wasted. On subsequent read,
requests from other processors to these words of the
same block, logically unnecessary invalidation misses
will be encountered, and the hit ratio decreases. Hence,
invalidation misses are recognized as a main obstacle in
attaining better performance in write-invalidate
protocols.
The WIP (Word-Invalidate Protocol) is based on the
principle of partial invalidation, which results in upgrading
of write-invalidate protocols with
a word invalidation capability, instead of the usual full
block invalidation. Two bits are associated with each
word in a block: V (valid) bit and LW (last write) bit.
When some processor updates a word in its cache, it
sends a request to other cached copies of the same
block, to invalidate only that particular word (by resetting its States MOD-EXC and MOD-SHD, which result
V bit). Also, LW bit of the updated word in the from a write access to a block, assume the ownership of
writer's cache is set, so subsequent writes of the same the block. The owner of the block is responsible to
processor to the same word, without intervening accesses supply the block on a request. Like in some other
to that word by other processors, can be performed protocols, a special bus "sharing line" (active low) is used
locally. This "fine-grain" invalidation approach is for dynamic determination of the sharing status. The
expected to achieve a better utilization of data already operation of the WIP protocol is specified for all possible
fetched in cache, and consequently a higher hit ratio and situations that can happen. The protocol mechanism is
a lower coherence overhead. also represented with state transition diagrams of
The protocol permits a "pollution" of valid Pactions (processor-induced) in the above figures.
blocks with a certain number of invalid words (partially
valid blocks). However, if the invalidated word is still
needed in the future, only that word has to be read IV. FUTURE SCOPE
instead of the entire block. After that cheaper "short Verification of cache protocols is somewhat more
read miss," the block is partially (or even fully) recovered. complex in the presence of latency tolerance hardware. To
Counting that more invalid words indicates that cached block verify cache protocols in some systems, the state expansion
is not actively used, the WlP protocol restricts their number process which searches for all reachable system states must
within the partially valid block (invalidation threshold). After take into account synchronization accesses as well as regular
exceeding the invalidation threshold, any subsequent data accesses.
invalidation of a valid word in the block invalidates the
entire block, increasing the probability of making the most Several problems like the lack of good methodology
actively used cached copy exclusive to its owner. Adaptive to deal with linked-list directory-based protocols still need to
combining of partial and full invalidation is supposed to be solved. This may be solved by separating the verification
bring certain benefits. Partial invalidation promises to be of correct maintenance of the sharing lists and verification
more appropriate when the active sharing of blocks is higher, coherence. Another problem is the verification of correct
while full invalidation is better when the active sharing is ordering of all the memory accesses. Memory access
lower. [10] ordering is often seen as an orthogonal issue to the problem
The WlP protocol is now completely presented of cache coherence since it is an issue whether or not there
through the description of its block states and state are caches.
transitions. The invalidation threshold of two invalid words
Memory consistency models is only meaningful in
the presence of caches. The memory model tells us when the
stores must propagate and the cache consistency protocol [3] Zainab Al-Waisi and Michael Opoku Agyeman, “An Overview of
On-Chip Cache Coherence Protocols” p.1 2017.
must propagate them in time for a subsequent read by a
remote processor. The methods based on state expansion [4] F. J. JIMÉNEZ MATURANA, C. PEINADO GÓMEZ, E.
HERRUZO GÓMEZ, J. I. BENAVIDES BENÍTEZ “Teaching the
covered in this survey are effective at proving that the cache cache memory coherence with the MESI protocol simulator”, 2006.
consistency hardware correctly fulfills this function. Finding [5] P. Stenstrom, “A Survey of Cache Coherence- Schemes for
a framework to deal with the memory consistency model in Multiprocessors,” pp. 12-24, 1990.
the context of formal verification based on state expansion [6] Ulrich Stern and David L. Dill “A u t o m a t i c Verification of the
would certainly be a breakthrough in the field of formal SCI Cache Coherence Protocol”, 2004.
architecture verification. [11] [7] Ulrich Stern and David L. Dill, “Automatic Verification of the cache
coherence verification protocol”.
[8] Hoichi Cheong and Alexander V. Veidenbaum , “A Cache Coherence
Scheme With Fast Selective Invalidation”, 2007.
[9] Charles P. Thacker, Lawerence C. Stuart, Edwin H. Saitterweithe
REFERENCES “Firefly: A Multiprocessor Workstation”, 2009.
[10] Milo Toma~evir, Veljko Milutinovi, “The word-invalidate cache
coherence protocol”, 1995.
[1] P. Michael E. Thomadakis, “The Architecture of the Nehalem
Processor,” The Architecture of the Nehalem Processor, p. 48, 2011. [11] Fong Pong and Michel Dubois “A Survey of Verification Techniques
for Cache Coherence Protocols”, 1996.
[2] S. Mittal and Nitin, “A New Approach to Directory Based Solution
for Cache Coherence Problem,” p. 5, 2015.

You might also like