You are on page 1of 9

To appear in the proceedings of the Workshop on Complexity-Effective Design held in conjunction with the 27th annual International Symposium on

Computer Architecture (ISCA-2000), Vancouver, British Columbia, Canada

Toward Complexity-Effective Verification:


A Case Study of the Cray SV2 Cache Coherence Protocol

Dennis Abts David J. Lilja Steve Scott
dabts@cray.com lilja@ece.umn.edu sscott@cray.com

Cray Inc.  University of Minnesota
Chippewa Falls, Wisconsin 54729 Electrical and Computer Engineering
Minnesota Supercomputing Institute
Minneapolis, Minnesota 55455

Abstract results from accessing the shared address space. The shared-
memory programming model can be viewed as an abstrac-
Modern large-scale multiprocessors, capable of scaling to tion layered upon this message passing substrate.
hundreds or thousands of processors, have proven to be very The FSMs at each level of the memory hierarchy are re-
difficult to design and verify in a timely manner. In partic- sponsible for implementing the coherence protocol. The co-
ular, the verification process, i.e., proving that the design herence protocol is a set of rules that, when followed by each
is functionally correct, is often the most time-consuming computing node, ensure a consistent view of the memory
aspect of developing the system. This paper discusses system [4, 5, 6, 7]. The coherence mechanism may be imple-
the methodology and early experiences of verifying the mented in a variety of ways, through either software, hard-
Cray SV2 cache coherence protocol. This paper proposes ware, or a combination of the two. A hardware-based coher-
a method of dealing with the verification complexity of ence mechanism provides better performance at the expense
a directory-based coherence protocol. We provide the of losing the flexibility that is afforded by software-based
framework for a methodology that is built on a formal approaches. Bus-based systems (such as the SGI Challenge
model of the coherence protocol, a language, and the RTL and Sun Enterprise) use a bus snooping approach. However,
implementation. Finally, we show how this approach was even at modest processor counts, contention for a shared sys-
used to verify the SV2 directory-based coherence protocol tem bus can degrade the memory system performance signif-
at the architectural level and at the corresponding Verilog icantly. A better approach for large-scale systems is to use
implementation level.

Figure 1: The local memory hierarchy at each node in a


1 Introduction DSM machine has several finite state machines (FSMs)
Shared-memory multiprocessors provide both scalability that communicate by exchanging micropackets.
and a flexible programming model. These features, however,
P1 P2 Pn
come at the expense of additional hardware complexity in the
coherent memory subsystem. The Distributed Shared Mem-
ory (DSM) architecture [1, 2, 3] provides a logically shared FSM FSM FSM
L1 $ L1 $ L1 $
address space, although the physical memory is distributed
among the computing nodes. This organization creates an L2 $
FSM
L2 $
FSM
L2 $
FSM
extended memory hierarchy that spans from the load-store
unit of a given processor through multiple levels of cache, FSM FSM FSM
and possibly across multiple nodes that communicate over an Mem Dir Mem Dir Mem Dir
interconnection network. The memory hierarchy is kept con-
sistent by several cooperating finite-state machines (FSMs)
Router Router Router
controlling each level of the memory hierarchy (Figure 1).
These FSMs interact via low-level micropackets. This mes-
sage flow is fundamentally responsible for the implicit trans- Scalable Interconnection Network
ferral of data through the extended memory hierarchy that

1
a directory-based coherence protocol in combination with a col at the architectural level and at the corresponding Verilog
scalable interconnection network. The directory is a data implementation level.
structure that stores the current state (i.e. access permission)
of a memory block, as well as a list of nodes with access to
the block. 2 Background and Related Work
2.1 Overview of the Cray SV2 Memory System
1.1 Motivation The Cray SV2 is a scalable distributed shared memory
Designing an efficient coherence protocol is very chal- (DSM) vector machine targeted at high-end scientific and
lenging and, as Lenoski and Weber [1] point out, “unfortu- technical applications. A single SV2 node consists of four
nately, the verification of a highly parallel coherence pro- unique chip types: a scalar/vector processor (P), an external
tocol is even more challenging than its specification.” The L2 cache (E), a memory directory (M), and an input/output
complexity of verifying these systems goes far beyond that (I) chip. Multiple nodes are interconnected using router (R)
of a typical VLSI circuit. The subtle, yet complex interac- chips. The P chip contains an instruction cache and the L1
tions between the FSMs that implement the coherence pro- data cache. The instruction cache coherence is maintained
tocol at each node makes verification very difficult, time- by software with explicit flushing when the icache contents
consuming and error prone. are potentially stale, such as after loading a program. The
To mitigate this complexity, hierarchical design methods SV2 cache coherence protocol spans the P, E, and M chips,
are used, with verification occurring at various levels of ab- where the L1 data cache is in the P chip, the L2 cache is in
straction (Figure 2). This allows architectural verification the E chip, and the memory directory is in the M chip.
to be performed early in the design process to verify that The basic processing element in the SV2 is a multistream
a proposed architectural feature of the system is sound. At processor (MSP). The memory directory uses a simple bit-
this abstract level, a variety of methods, ranging from formal vector directory pointer to track the sharing set of memory
methods to behavioral simulation, can be employed. The block [4]. The memory directory tracks L2 caches only, not
confidence that the architectural verification provides is lim- L1 caches. Since the L2 cache is inclusive of the L1 cache,
ited by how closely the abstracted model reflects the actual we need only track L2 caches at the directory, then each L2
implementation and the assumptions made during the veri- cache maintains a 4-bit inclusion vector with each cache line
fication process. The implementation verification uses tradi- to indicate which L1 cache (P chip) is sharing the line.
tional logic simulation techniques to verify the register trans-
2.2 Related prior work
fer language (RTL) specification of the design.
Unfortunately, the verification process does not have the Collier [8, 9] developed a suite of high-level tests called
nice step-wise refinement property the design process has. ARCHTEST to reason about the correctness of a multipro-
That is, there is no cohesive method of binding together the cessor memory system. Although the test programs them-
architectural and implementation verification efforts. In this selves are relatively simple, the post-processing of the results
paper, we propose a methodology that is built on a formal is complex. The ARCHTEST suite can be compiled and ex-
model of a coherence protocol and show how this approach ecuted on an existing multiprocessor. Unfortunately, it pro-
was used to verify the SV2 directory-based coherence proto- vides little assistance in the development of new hardware.
Furthermore, ARCHTEST is unable to detect deadlock, live-
lock, and loss of data coherence.
Figure 2: Each stage of the design process can be viewed Dubois et al constructed a prototype [10, 11] of the Stan-
as a step-wise refinement of the prior stage, and verifica- ford DASH [12] multiprocessor using reconfigurable logic
tion depicted as a comparison of adjacent stages of the devices (FPGAs). The goal of the project was to pro-
design process. vide a verification vehicle as well as a platform for per-
Functional  formance evaluation. They implicitly verified aspects of
Specification Architectural the DASH multiprocessor by running the SPLASH paral-
Verification lel benchmarks. Unfortunately, this approach is very time-
Design  consuming, expensive, and cannot be performed until late in
Specification Implementation
the design process. Furthermore, finding the design flaws is
Verification
Implementation
 very difficult since only limited visibility and primitive tools
(RTL) are available for probing into the logic design.
Equivalence
Checking
Lamport logical clocks have been extended to study the
Structural
(gates) correctness of memory consistency models [13, 14]. This
time-stamping technique provides a tool for reasoning about
Physical the proper event orderings in a multiprocessor memory sys-
Design tem. However, it is unclear how this approach can be used to

2
detect deadlock, live-lock, and fairness. Moreover, since this Property 1 If an address,  , is in the “noncached” state at
analysis must be integrated into an architectural simulation the directory and there are no messages,  , in-flight in the
tool, it can be used to reason about the correctness of only interconnection network, then all caches must have address
a particular program execution. Obviously, it is impossible  in an invalid state.
to conclude that, if the system is correct for a given program 

   !#"$%&')(*,+.-/-/-0
execution, it must be correct for all executions and all pro- 
grams. 132 547698,:; <=?>A@9B@C%5&D')(E,+.-/-!F 2  G  69899+@,B-
More recently, formal verification methods have been
used to validate the coherence protocol of the SGI Ori- where  is an address, H is a  processor
- cache, and  is a
gin2000 [15, 16, 17] and the Sun S3.mp (Sun Scalable message. The function Home  returns the identity of the
Shared-Memory Multiprocessor) [18, 19]. Eiríksson used 9I- for managing the address  .
memory directory responsible
the Symbolic Model Verifier (SMV) [20] to verify an ab- Likewise, the function Dir returns the state of the mem-
stract model of a three node Origin2000 system (two pro- ory directory (access
 permission and sharing set) for a given
cessors and an I/O unit). Pong, et. al. used the Mur  [21] memory directory .
formal verification system to verify an abstracted three node
Property 2 If an address,  , is present in cache, H , then it
S3.mp system. Each of these efforts needed to model the sys-
must be included in the sharing set by the directory.
tem at a high-level of abstraction (ignoring as much detail as
J ;KMLNOC <*,+@9B-PFRQS NL 895: QS=<*/!#"$%&')(*,+.-/-
possible) to make the verification tractable. Despite the un- 
realistically small verification model, these methods proved
to be very valuable in extracting very subtle design flaws that The SharingSet predicate returns true if the memory direc-
would have been difficult, if not impossible, to detect using tory knows that address,  , is present in cache, H . Another
traditional logic simulation. Nonetheless, in both cases ex- way to look at this is the set of caches with address,  ,
tensive logic simulation was performed to verify the actual present is a subset ( T ) of the sharing set at the directory.
RTL implementation. While these two properties do not explicitly address the
read-the-latest-write aspect of memory coherence, they do
ensure that the memory directory is properly maintaining the
3 Formal Correctness sharing set, an essential ingredient for memory coherence.
Showing that a cache coherence protocol is correct is non- Property 2 says that a cache line may still be tracked by
trivial, as there are many aspects to “correctness,” and the the memory directory, even if it is no longer present in the
protocol state space is very large. Our approach is to for- cache. For instance, if a cache line is evicted there will be
mally model the protocol and prove that collection of well- some transient time between the eviction notice being sent
defined, fundamental properties hold over the state space. and the memory directory removing the cache from the shar-
We expect that most coherence protocols would require sim- ing set. As such, the cache could receive a “phantom” in-
ilar properties. The properties make no assumptions about validate from the directory for a cache line that is no longer
the detailed implementation of the protocol. Rather, we use present.
generic predicates and functions1 to describe the state of the
caches, directory, and interconnection network. 3.2 Forward Progress
Ensuring forward progress requires every memory re-
3.1 Data Coherence quest to eventually receive a matching response. Since all
A memory system is coherent if the value returned by a coherent memory transactions occur using request-response
load is always the value from the “latest” store to the same message pairs, we can exploit this fact by formally stating:
memory [6, 22]. On the surface this notion of data coher-
Property 3 Each request must have a satisfying response.
ence appears vague, so this point bears some elaboration. As

UVWX
YOC<Z?[ -!F]\.^_V`O  Oa9b.-
0Q <C8,OdcIO*9b @/[-
a memory request propagates through the memory hierarchy, H 
hardware components, such as arbiters and buffers, will im-
pose a serial ordering on all the memory operations to the Moreover, the forward progress property (Property 3) en-
same address. This linearization of memory events provides capsulates the notion of deadlock and live-lock avoidance
context for the word “latest” in the definition of memory co- by requiring each request to eventually receive a matching
herence. response. Deadlock is the undesirable condition making it
We indirectly capture the notion of data coherence by impossible to transition out of the current global state and
making some assertions about the state of the memory di- live-lock is a cycle
V`X
ofYstates
OC<*9[- that prevents
VW=O forward
Oa9bD- progress.
rectory and caches. The predicates [ and b H evaluate to
a logical true if is a request and is a response, respec-
1 Predicates are designated by bold typeface and evaluate to a logical true
Q <C8,OdcIOZ9b@[-
or false. Functions return a value and are in sans serif typeface tively. Similarly, the predicate  evaluates to

3
Figure 3: An example coherence protocol specification Figure 4: Work-flow diagram describing the steps in-
e L1 cache.
for an
= f Invalid, Exclusive, Shared, Pending g
volved in constructing the protocol verifier.
h = f Invalid, Exclusive, Shared g Coherence Protocol Specification
i/j
k = Invalid
= f PrRead, PrWrite, Inval, ReadResp, GrantExcl g
(*.tbl)
{
i!l e m l k ndo iCp mZq r (action)
Protocol
Invalid PrRead Pending L2(L1ReadReq)
Compiler
Invalid PrWrite Pending L2(GetExclusive)
Invalid Inval Invalid L2(InvalAck) (tbl2m)
Invalid ReadResp - Error(UnexpectedMsg) {
Invalid GrantExcl - Error(UnexpectedMsg)
Exclusive PrRead Exclusive P(ReadResp) Mur 
Exclusive PrWrite Exclusive P(WriteComplete) Specification
Exclusive Inval Invalid L2(InvalAck) (*.m)
Exclusive ReadResp - Error(UnexpectedMsg) {
Exclusive GrantExcl - Error(UnexpectedMsg)
Shared PrRead Shared P(ReadResp) Mur 
Shared PrWrite Pending P(GetExclusive) Compiler
Shared Inval Invalid L2(InvalAck)
(mu)
Shared ReadResp - Error(UnexpectedMsg)
Shared GrantExcl - Error(UnexpectedMsg) {
Pending PrRead Pending Block(PrRead)
Pending PrWrite Pending Block(PrWrite)
Intermediate
Pending Inval Pending Block(Inval)
Specification
Pending ReadResp Shared P(ReadResp) (*.C)
Pending GrantExcl Exclusive P(WriteComplete) {
Compile
b [ and Link
a logical true if satisfies . For example, the predicate
Q C< 8,OdcIOZ?b @[- (g++)
 would consult the transition
b relation for the
coherence protocol to determine if was an expected re- {
[
sponse to request . Clearly, this property ensures forward Protocol Verifier
progress by ensuring that a request is never starved or indef-
initely postponed.
However, the forward progress property makes no claims 3.4 Unexpected Messages
about fairness. For instance, it says nothing about the dis-
tribution of service times or even that requests are serviced A coherence protocol is specified for ,z@/[each
@|N@~}=- level of the
in an equitable manner — these are more implementation- memory hierarchy z as a set of 4-tuples
[ , where the
specific properties that deal with the performance of the current state is . When input|.,zis@/[received,
- then the current}
memory system and not its correctness. state transitions to new state and performs action .
Error conditions should be explicitly defined. Unexpected
3.3 Exclusivity messages will occur only where the protocol is incorrectly
or incompletely specified by the protocol designer. Figure
The coherence protocol enforces some access permis- 3 gives an example protocol specification for an L1 cache,
sions over the shared-memory to ensure that there are never z3€ [‚
where |.9z@is[the
- state of the cache line, is the input
two or more processors with “exclusive” (write) access to message, is the new state resulting from this transi-
the same memory block. This single-writer property can be } z
tion, and is the action that the controller takes. A state, , is
stated as: zƒ€…„`† z‡ †
considered transient if and quiescent if . In
X
Property 4 Two different caches, H and , should never the example shown in Figure 3, the Pending state is transient
have write access to the same address,  , at the same time. and Invalid is quiescent.
J ; 2 aO sW89La<ut9+@9B -PFv
way x  132 OasW89La<ut_9+@zE-
X 4 Verification Methodology
This property ensures that no two processors H and are able
to have memory block  in their local memory hierarchy in 4.1 Formal Verification Model
the “dirty” state 2 at the same time. The input in the verification process is a formal specifi-
2 some protocols use the “dirty” or “exclusive” state, we assume the pred- cation of the cache coherence protocol, as shown in Figure
icate will return true if the cache line is dirty or exclusive. 3, for each level of the memory hierarchy. The objective is

4
Figure 5: Block diagram of the formal verification model, which is described in the Mur description language. The
model is written to be scalable from 1. . . N pseudonodes.

Proc Proc Proc

L1 cache L1 cache L1 cache


tag state data tag state data tag state data

L2 cache L2 cache L2 cache

tag state data tag state data tag state data

tag state sharers tag state sharers tag state sharers


Local Local Local
Memory Memory Memory

memory memory memory


directory directory directory

Network Network Network


Interface Interface Interface
Pseudo Node Pseudo Node Pseudo Node

Virtual Network

to show that the coherence protocol is architecturally sound to eliminate the notion of “timing” within the design; that is,
by satisfying the correctness properties outlined in Section we ignore propagation delay of the actual hardware.
3. Once we have established the correctness of the proto- The formal verification model (Figure 5) is scalable from
col at an abstract level, we then would like to show that its 1 . . . N “pseudonodes”, however, only two pseudonodes are
implementation is also correct. required for the formal verification. Each pseudonode has
To verify the coherence protocol at an abstract level, we only a single processor and a single L2 cache in each MSP
used the Mur  formal verification environment [21]. The co- whereas the real machine has four P chips and four E chips
herence protocol is specified as several human-readable text per MSP. In effect, we are taking a “slice” of an MSP. Each
files. These files are then read by a protocol compiler to au- cache line is only a single-bit, with a single-bit tag. Inter-
tomatically generate the finite-state machine descriptions in mediate hardware structures, such as output request buffers
the Mur  description language (Figure 4). The Mur  com- and transient buffers, are modeled where necessary for accu-
piler is then used to create the intermediate C++ description racy. The pseudonodes are connected by a virtual network
which is compiled and linked into the protocol verifier. with three virtual channels (vc0, vc1 and vc2). The virtual
Constructing the formal verification model is a very time- channel buffers are only one entry deep.
consuming and difficult task. A detailed understanding of the The coherence protocol is blocking (no “retry” nacks are
design is necessary to be able to abstract away any unneces- used) so packets that cannot be processed can put back pres-
sary details. What details are “unnecessary?” Three types sure on the virtual network. Deadlock is avoided by guar-
of abstraction are necessary to make the formal verification anteeing that packet dependency chains are acyclic. The
tractable: system scaling, data abstraction, and temporal ab- longest dependency chain occurs when (1) the L2 cache
straction. System scaling pretends that the system is much makes a request to the directory, (2) the directory forwards
smaller than it really is, while still providing enough detail the request to the owner, and (3) the owner cache sends a
to capture the essence of the original design. Data abstrac- cache-line response back to the directory. Packets between a
tion seeks to re-encode the meaning of a variable to reduce sender and receiver pair on a given virtual channel must be
the number of reachable states. Temporal abstraction is used delivered in-order by the virtual network.

5
Figure 6: Producing witness strings from a depth-first search of the state space which are later played back on the actual
hardware.

Protocol Mur  Logic


Verifier ˆ Witness Strings ˆ Simulator
ˆ Accept or
Reject?
(sv2 cc) (*.ws) (Gensim)

(a) using witness strings to refine the abstract formal verification allowing verification of the implementation.

witness
string

S0

(b) a witness string produced by a depth-first search of the state space.

4.2 Verification of the Implementation Model from the abstracted verification to be used to verify the im-
plementation. The Mur  formal verification system used to
The formal verification model described above can be verify the architectural soundness of the coherence protocol
used to automatically verify the coherence protocol at the will enumerate all reachable states and will search the state
architectural level. However, to verify the implementation space to establish the correctness of the properties specified,
itself we must build a connection from this abstract, architec- or it will show a counter-example. This search process can be
tural model to the Verilog implementation model. In verify- conducted in three possible ways: 1) breadth-first, 2) depth-
ing the actual implementation of a cache coherence protocol, first, or 3) random. We discovered that if we made trivial
we are no longer dealing with an abstract representation of modifications to the Mur  source code, we could observe the
the design. Instead, the complexity and sheer enormity of a nondeterministic firing of rules as the finite-state machines
system comprised of several deep submicron ASICs places at each pseudonode interact. We developed a method for
an overwhelming burden on the verification effort. recording these events during a depth-first search of the state
There are several ways to attack this verification problem. space in Mur  , making it possible to reproduce these inter-
One possible approach is to construct a testbench around the actions on the actual hardware (Figure 6). We refer to the
actual hardware and write some pseudorandom and directed set of events from the start state, ‰ Š , to a leaf node as a wit-
diagnostics to attempt to expose implementation flaws. This ness string since it “witnesses” the execution of the formal
is a useful exercise and will undoubtedly uncover some im- verification model. In a formal sense, we can say that the set
plementation errors. However, enumerating all the possible of all witness strings accepted by the formal model defines
event orderings and “interesting cases” is extremely difficult the language, ‹ , accepted by the coherence protocol. If the
and time consuming. language ‹ is also accepted by the implementation verifica-
We feel a better approach is to refine the abstracted for- tion, then we have a rigorous connection between the verified
mal verification model developed above to allow the results architectural model and its implementation.

6
Figure 7: An example of a protocol error that was discovered using Mur – had this error gone undetected it would
have resulted in a loss of memory coherence.
Pseudo-Node 1 Pseudo-Node 2

X Y

Shared : E2 Shared : E1
(2) Shared : E1, E2 (6) Shared : E1
(5) Shared : E2 (8) Noncached
8

6
M1 M2
2 2a 5 6a
ReadSharedRsp(X)

)
p(Y
ro
MRead(X)

MDrop(X)

Y)
D
M ad(
Re
M

)
p(Y
edRs
1a 3 4b har
adS
1b
Y : ShClean Re X : ShClean
4a
(1) X : PendingReq
(3) X : ShClean 7
(4) Y : PendingReq
(7) Y : ShClean
E1 E2
4 3a 4c 1c
1
PReadResp(X)

PInvalidate(Y)
PInvalidate(X)
Read(Y)
Read(X)

Proc0 Proc0

5 Results issue more requests, the reachable state space grew expo-
In this section we discuss some preliminary results of the nentially.
SV2 cache coherence verification effort. While the formal When an error is reported by the protocol verifier, we
verification of the abstracted model is complete, the verifica- must determine if the error is a problem with the formal ver-
tion of the implementation is still ongoing. ification model or if it is a genuine protocol design error.
Most of the early errors were a result of modeling errors in
5.1 Architectural Verification the virtual network. However, several protocol design er-
The objective of architectural verification is to show that rors were found one of which would have resulted in a loss
the cache coherence protocol is sound; that is, that it satisfies of memory coherence. Figure 7 shows the protocol error.
the properties outlined in Section 3. The SV2 formal veri- The arrows in Figure 7 show the packets exchanged between
fication model consists of two pseudonodes and the virtual chips, with a square marker indicating the time the packet
network. This model requires 1536 bits of state information. was sent from the chip and a circular marker indicating the
Once the formal verification model was constructed, we be- time the packet was received. Initially, we see that the pro-
gan to verify the protocol in a step-wise manner. First, we al- cessor at pseudonode 1 has the address Y in the ShClean
lowed processors to issue only scalar Read requests. Then, state. A Read(X) request at event time 1 results in an evic-
we allowed processors to issue scalar Read and SWrite op- tion of Y from the E1 cache. The eviction causes a PInvali-
erations. This incremental approach kept debugging rela- date(Y) packet to be sent to the dcache in the processor and
tively simple. However, as we allowed the processors to an MDrop(Y) eviction notice to be sent to the directory.

7
Figure 8: An example of a witness string generated by Figure 9: The markers inserted in the stimulus from the
the Mur formal verification tool. witness string shown in Figure 8.
-------------------------------------------------- Quiescent
Issuing scalar request Read from Proc_1 of Node_1 E1(X:ShClean)<---Read(Y) on vc0 from P1 [13]
-------------------------------------------------- E1 sending PInvalidate(X) to P1 [14]
E1(1:ShClean)<---Read(2) [C=0] on vc0 from Proc_1 E1 sending MDrop(X) to M1 [15]
Quiescent: 1 E1 sending MRead(Y) to M2 [16]
Note: Evicting addr 1 from the Ecache E1(Y:PendingReq)<---RExclResp(Y) on vc1 from M2 [17]
E1 sending PInvalidate(1) to Proc_1 E1 sending PReadResp(Y) to P1 [18]
E1 sending MDrop(1) to M1 on vc2 Quiescent
E1 sending MRead(2) to M2 on vc0
--------------------------------------------------
M1(Shared)<---MDrop(1) on vc2 from E1
Quiescent: 0
-------------------------------------------------- The witness strings from the Mur  formal verification are
M2(Noncached)<---MRead(2) on vc0 from E1 post-processed into stimulus encoded as Raven apply() and
Quiescent: 0 verify() calls.
--------------------------------------------------
Memory manager rule fired. As an example, Figure 8 shows a snippet from a witness
-------------------------------------------------- string as generated by the Mur  formal verification tool.
M2(PendMemExclusive)<---MemRExclRsp(2) from MMGR1 This string is converted into stimulus for the Verilog simu-
Quiescent: 0
M2 sending ReadExclResp(2) to E1 on vc1 lation model by choosing a “perspective” from which to ob-
-------------------------------------------------- serve the events. In this case, we will observe them from the
E1(2:PendingReq)<---RExclResp(2) on vc1 from M2 perspective of the E chip (L2 cache). Alternatively, we could
Quiescent: 0
have chosen to observe them from the memory directory, for
Note: Filling Ecache line...
E1 sending PReadResp(2) to Proc_1 instance. Choosing the E chip perspective implies that we
-------------------------------------------------- will be injecting stimulus into the E chip and verifying that
the E chip produces the correct responses. The example in
Figure 8 is a very simple sequence of events where a pro-
At the same time, the MRead(X) request is sent to the
cessor is simply making a Read(Y) request. However, this
directory, which responds with a ReadSharedRsp(X) re-
simple request has a total of six memory transactions associ-
sponse. Then, at event time 4 the processor issues a Read(Y)
ated with it. Markers are inserted into the stimulus when the
request to the E1 cache. The E1 cache evicts X and sends
memory system is quiescent, as shown in Figure 9.
a PInvalidate(X) to the dcache in the processor and an
MDrop(X) eviction notice to the directory. At the same
time, the E1 cache sends the MRead(Y) request to the direc-
tory, which responds with a ReadSharedRsp(Y). Finally, 6 Conclusion
the MDrop(Y) request from the eviction notice sent at time
1b reaches the directory at event time 8. When it receives We have validated this approach by successfully “replay-
the eviction notice, the directory removes E1 from the shar- ing” the witness strings from the Mur  formal verification
ing set and checks to see if there are any remaining sharers. model on the RTL implementation of the E chip. In the pro-
Since there are none, the directory transitions to the Non- cess we uncovered eight implementation errors relating to
cached state. At this point, any stores to Y will not be propa- the cache coherence engine in the E chip. Soon, we will be
gated to the E1 cache, resulting in a loss of cache coherence. using this approach to verify the cache coherence engine at
While it would have been difficult to construct a test to un- the memory directory (the M chip).
cover the complex sequence of events that led to this error, It is commonplace in computer architecture to use trace-
our formal verification approach was able to discover it au- driven simulation to evaluate the performance of an architec-
tomatically. tural feature. We propose a similar idea for verifying the cor-
rectness of a cache coherence protocol based on a formal ex-
5.2 Implementation Verification ecution trace generated by executing the formal verification
The objective of the implementation verification is to run model. This formal trace file can be encoded into a practical
the “witness strings” generated by the Mur  formal verifi- format that allows it to be simulated on the actual RTL im-
cation on the Verilog RTL implementation of the hardware. plementation. This technique provides a rigorous method for
The Verilog is compiled and simulated using an internally bridging the abstraction gap between the architectural verifi-
developed tool called Gensim [23]. The witness strings are cation and the RTL implementation verification. While this
encoded using a high-level verification environment called work is still ongoing, our initial experience with this pro-
Raven [24, 25], which is built around the C/C++ language. posed method is very promising.

8
Acknowledgements [14] A. Condon, M. Hill, M. Plakal, and D. Sorin. Using lamport
clocks to reason about relaxed memory models. In Proc. of the
This work was supported in part by National Science 5th International Symposium on High-Performance Computer
Foundation grants no. EIA-9971666 and CCR-9900605. Architecure (HPCA-5), Jan 1999.
[15] James Laudon and Daniel Lenoski. The SGI origin: A cc-
NUMA highly scalable server. In Proceedings of the 24th
References Annual International Symposium on Computer Architecture
[1] Dan Lenoski and W.D. Weber. Scalable Shared-Memory Mul- (ISCA-97), volume 25,2 of Computer Architecture News,
tiprocessing, pages 143–170, 134–140. Morgan Kaufmann pages 241–251, New York, June2–4 1997. ACM Press.
Publishers, 1995. [16] Ásgeir Th. Eiríksson, John Keen, Alex Silbey, Swami
[2] John Hennessy and David Patterson. Computer Architecture: Venkataraman, and Michael Woodacre. Origin system design
A Quantitative Approach, pages 655–749. Morgan Kaufmann methodology and experience: 1m-gate asics and beyond. In
Publishers, second edition, 1996. COMPCON-97, 1997.

[3] David Culler, Jaswinder Pal Singh, and Anoop Gupta. Paral- [17] Ásgeir T. Eirı́ksson. Integrating formal verification methods
lel Computer Architecture: A Hardware/Software Approach, with A conventional project design flow. In 33rd Design Au-
pages 273–305 and 589–610. Morgan Kaufmann Publishers, tomation Conference (DAC’96), pages 666–671, New York,
1998. June 1996. Association for Computing Machinery.
[18] F. Pong, M. Browne, G. Aybay, A. Nowatzyk, and M. Dubois.
[4] David J. Lilja. Cache coherence in large-scale shared-memory
Design verification of the S3.mp cache-coherent shared-
multiprocessors: Issues and comparisons. ACM Computing
memory system. IEEETC, 47(1):135–140, 1998.
Surveys, 25(3):303–338, September 1993.
[19] Fong Pong and Michel Dubois. Formal verification of com-
[5] S. V. Adve. Designing Memory Consistency Models for plex coherence protocols using symbolic state models. Jour-
Shared-Memory Multiprocessors. PhD thesis, Computer Sci- nal of the ACM, 45(4):557–587, July 1998.
ences Department, University of Wisconsin-Madison, De-
cember 1993. [20] K. L. McMillan. Symbolic Model Checking, pages 61–85.
Kluwer Academic Publishers, 1993.
[6] S. V. Adve and K. Gharachorloo. Shared memory consistency
models: A tutorial. IEEE Computer, 29(12):66–76, December [21] D. L. Dill, A, J. Drexler, A. J. Hu, and C. H. Yang. Protocol
1996. verification as a hardware design aid. In International Con-
ference on Computer Design, VLSI in Computers and Proces-
[7] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, sors, pages 522–525, Los Alamitos, Ca., USA, October 1992.
A. Gupta, and J. L. Hennessy. Memory consistency and IEEE Computer Society Press.
event ordering in scalable shared-memory multiprocessors.
[22] L. Lamport. How to make a multiprocessor computer that
In Proc. 17th Annual Int’l Symp. on Computer Architecture,
correctly executes multiprocess programs. IEEE Transactions
ACM SIGARCH Computer Architecture News, page 15, June
on Computers, 28(9):690–691, 1979.
1990. Published as Proc. 17th Annual Int’l Symp. on Com-
puter Architecture, ACM SIGARCH Computer Architecture [23] T. Court. Gensim user manual. Technical report, Cray Inc.,
News, volume 18, number 2. 1996.
[8] William W. Collier. Reasoning About Parallel Architectures. [24] D. Abts and M. Roberts. Verifying large-scale multiproces-
Prentice-Hall, 1992. sors using an abstract verification environment. In Proceed-
ings of the 36th Design Automation Conference (DAC99),
[9] W. Collier. Multiprocessor diagnostics home page pages 163–168, June 1999.
www.infomall.org/diagnostics/archtest.html.
[25] D. Abts, M. Roberts, and D. Lilja. A balanced approach to
[10] L. Barroso, S. Iman, J. Jeong, K. Oner, K. Ramamurthy, and high-level verification: Performance trade-offs in verifying
M. Dubois. RPM: A rapid prototyping engine for multi- large-scale multiprocessors. August 2000. To appear in the
processor systems. IEEE Computer, 28(2):26–34, February Proceedings of the 2000 International Conference on Parallel
1995. Processing (ICPP-2000).
[11] M. Dubois, J. Jeong, Y. H. Song, and A. Moga. Rapid hard-
ware prototyping on rpm-2: Methodology and experience.
[12] D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hen-
nessy, M. Horowitz, and M. Lam. Design of scalable shared-
memory multiprocessors: The DASH approach. CompCon
Spring 1990.
[13] M. Plakal, D. J. Sorin, A. E. Condon, and M. D. Hill. Lamport
clocks: Verifying A directory cache-coherency protocol. In
Proc. of the 10th ACM Annual Symp. on Parallel Algorithms
and Architectures (SPAA’98), June 1998.

You might also like