You are on page 1of 11


Milo TomaievZ Department of Computer Engineering Pupin Institute POB 15 11000 Belgrade, Yugoslavia email:etomasev@yubgefi1.bitnet
Appropriate solution to the well-known cache coherence problem in shared memory multiprocessors is one of the key issues in improvingperformance and scalability o f these systems. Hardware methods are high& convenient because of their transparency for sofhare. They also offer goodperfonnance since thty deal with the problem filly dynamically. Great variety of schemes hm been proposed; not mrmy of them were implemented. Two mdor groups of hardware protocols can be recognized: directory and snoopy. This survey underlines their principles and summarizes a relatively large number of relevant representatives j?om both groups. Coherence problem in multilevel caches is also briejly considered. Special attention is devoted to cache coherence maintenance in large scalable shared memory multiprocessors.

Veljko Milutinovi6 School of Electrical Engineering University of Belgrade POB 816 11000 Belgrade, Yugoslavia email:milutino@pegasus. ch
bandwidth. However, t h i s solution imposes another serious problem. Multiple copies of the same data block can exist in Merent caches, and ifprocessors are allowed to update freely m m i their own copies, inconsistent view of the memory is i nent, leato program malfunction. This is the essence of the well-known cache coherence problem [Dubo88]. The reasons for coherence violation are: shanng of writable data, process migation, and YO activities. System of caches is said to be coherent if every read by any processor h d s a value produced by the last previous write, no matter which prowssor performed it. Because of that, system must incorporate some cache coherence mainhaace mechanism which consists of a complete and consistent set of operations (for accessing shared memory), preserves coherent view of the memory,and ensures program execution with correct data. Cache coherence problem has attracted considerable attention through the last decade.A lot of research at prominent universities and companies has b e e n devoted to that problem, resulting in a number of proposed solutions. Importance of the problem is emphasized by the fact that not only the cache coherence solution is necessary for correct program execution, but it can have a very sigmiicant impact on system performance, and it is of utmost importance to employ cache coherence solutions as efficiently as possible. It is firmly proved that the efficiency of cache coherence solution depends to the great extent on system architecture parameters and especially on parallel p r o m characteristics P g g e 8 9 1 . Choice of apo r e critical in large multiprocespropriate solution is even m sors where some inefficiencies in preserving the coherence maintenance are multiplied, which can seriously hurt the scalability.

1. Introduction
Multiprocessors are the most appropriate type of computer systems to meet the ever-increasing needs for more computing power. Among them, shared m e m o r y multiprocessors represent an especially popular and efficient class. Their increasing use is due to some sigmiicant advantages that they offer. The most important advantage is the simplest and the most general programming model, which allows easier development of parallel software, and supports efficient shanng of code and data. However, nothmg comes for fiee. Shared memory multiprocessors sufFer from potential problems in achieving high performance, because of the inherent contention in accessing shared resources, whch is responsible for longer latencies in accessing shared memory. htrodwtion of private caches attached to the processors helps greatly in reducing average latencies (Figure 1). Caches in multiprocessors are even more useful then in uniprocessors, since they also increase effective memory and communication
This research was partially sponsored by the NCR Corporation, Augsburg, Germany, and the NSF of Serbia, Belgrade, Yugoslaia The EhDOT simulation took were donated by the ZYCAD Corporation, Menlo Park Calfomia. USA. and the TDT Corporation, Cleveland Heights. Ohio, USA. The MOSIS compatible W design took were provided by the T m Corporation,Pasadena. Calrfomia, USA.The W design took were provided by the ALTER.4 Copration, pm Sonto lara. California, USA.

A 4

- memorymodule - privatecache
- processor
interconnection network



proceedings oftheHawaii International Confirerice on system Sciences, K o k . Hawaii, USA.,January 54.1993.

Figure 1: A shared memoxy multiprocessorsystem with private caches


0-1060-3425/93 $03.00 @ 1993 IEEE

2. Cache coherence solutions

Basically, all solutions to the problem can be classi6ed into two large groups: software-based and hardware-based [StenW]. This is a traditional classi6cation that still holds, since the proposed solutions (so far) follow fully or predormnantly one of the two appaches. Meanwhile, solutions using combination of hardware and software means become more hquent and promising, so it seems that this classification will be of less use in t h e hture.
2.1. !Wtware solutions Software-based solutions generally rely on the actions of the programmer, compiler, or operating system, in deidmg with the coherence problem. The simplest but the most restrictive method is to declarenoncacheable pages of shared data. M o r e advanced methods allow the caching of shared data and accessmg them only in critical sections, in a mutually exclusive way. Special cache managing instructions are o h used for cache bypass, flush, and indiscriminate or selective invalidation, in order to maintain coherence. Decisions about coherence related actions are often made statically durmg the compiler analysis (which tries to detect conditions for coherence violation). There are also some dynarmc methods based on the operating system actions. Software schemes are generally less expensive than their hardware counterparb, although t h e y may require considerable hardware support- It is also claimed that they are more convenient for large, scalable multiprocessors. On the other side, some disadvantages are evident, especially in static schemes, where inevitable inefficiencies are incurred since the compiler analysis is uuable to predict the flow of program execution accurately and umservative assumptions have to be made. Software-based solutions will not be elaborated upon any further in this paper. 2.2. Hardware solutiom We have focused our attention on t h e hardware-based solutions, usually called cache coherence protocols. Although they require an i n d hardware complexity, their cost is well justified by siguificant advantages of the hardware-based approach .Hardware s c h e s deal with coherence problem by dynamic recognition of inconsistency conditions for shared data entirely at n m t i m e . They promise better performance, especially for higher levels of data shanng, since the coherence overhead is generated only when actual shanng of data takes place. .Being totally transparent to software, hardware protocols free the programmer and compiler h m any responsibility about coherence maintenance, and impose no restrictions to any layer of software. 0 V a r i o u s proposed hardware schemes efficiently support the 111 range fkom small to large scale multiprocessors. .Technology advances made their cost quite acceptable, compafed to the system costs. Due to aforementioned reasons, hardware cache coherence protocols are much more investigated in the literature, and also much more fiequently implemented in commercial multiprocessor systems. Essential characteristics of existing hardware schemes are reflected in respect to the following criteria:

Where the status information about data blocks is held and how it is orgamzed, W h o is responsible for preserving the coherence in the system, 0 What kind of coherence actions are applied, Which write strategy is accepted, Which type of notification is used, etc. Acmrdmg to the most widely accepted classification based on the first and the second criterion, hardware cache coherence schemes can be principally divided into two large groups: directoq and snoopy protocols. In the next two sections, basic principles are presented, and numerous solutions belonging to both groups are surveyed. After that, cache coherence in multilevel hierarchies is discussed. Fmally, employing of previous methods in large scalable multiprocessors with cache-mherent architectures is considered. This survey follows one specific organizational structure. Each approach to be considered here is briefly introduced. Then, different examples are presented. Whenever appropriate, the following points of view will be underlined: 0 Research environment Essence of the approach 0 Selected details of the approach 0 Advantages Disadvantages 0 Special comments (performance, etc.). This survey tries to be as broad as possible; however, it is not exhaustive, becauseof space restrictions.

3. Directory protocols
The main characteristic that distinguishes this group of schemes is that the global system-wide status information relevant for coherence maintenance is stored in some kind of directory. The responsibility for coherence preserving is predominantly delegated to a centralized controller which is usually a part of the main memory controller.Upon the individual requests of the local cache controllers, centraked mtroller checks the directory, and issues necessaq commands for data transfer between memoxy and caches, or between caches themselves. It is also responsible for keeping status information up-to-date, so every local action which can affect the global state of the block must be reported to t h e central cont r o l l e r .Besides the global drrectory maintained by the central controller, private caches store some local state information about cached blocks. Directory protocols are prrmanly suitable for multiprocessors with general interconnection netWorks. Paper [Agar891 introduces one useful classfication of directory schemes denotmg them as Dir.X, where i is the number of pointers, and X is B and kB, for broadcast and no-broadcast schemes, respectively. The global directory can be orgamzd in several ways. Amto that issue, directory methods can be divided i n t o three groups: hll-map directory,limited dmctory, and chained directory schemes.
3.1. Full-map directory schemes The main characteristic of these schemes is that the directory is stored in the main memory, and contains entries for each memory block. An enby points to exact locations of every cached copy of memory block, and keeps its status. Us-


ing this inf-tion, cohof data in private caches is maintainedby directed m e s s a g e s to known locations, avoidmg u s u a l l y expensive broadcasts. The Grst protocol ftom this class is developed in [Tans761 and later impleme&d in IBM3081. It use^ an invalidation approach and allows the existence of multiple unmodified cached copies of the same block in t h e system, but onlyanemodifiedcopy. MemOrydireCtoIy is orgamedas a set of copies of all individual cache directories. Such an organization implies some serious disadvantages. Directory search is not easy, because duplicates of all cache dhctolies . . mustbecheckedwhende&q the status of a particular memory block. Also, information in memory directoly and i m e . cache directorieshas to be consistent all the t Classical full-map directory scheme proposed in h e same coherence policy, but provides a [Cens78] applies t much more efficient directory search. Directory contains entries for each memory block in t h e form of a bit vector (Dir,NB scheme). Directoly entry consists of N + 1 bits: one presence bit per each of N processorcache pairs, and one bit which denotea whether t h e block i s modified in one of the caches (Figure 2 a ) .Local cache dkctories store two bits per cache block (valid and modified bits). A very similar directory scheme is described in [Yen85]. M e m o r y directory is identidy orpmad, but cache entries haw an additional bit per block, whi& is set in t h e case of clean exclusive copy. This eliminates the need for directory access for write hits to umnodified private blocks, but implies t h e burden for maintainingcorrect value of this bit. The main advantage of the U-map approach is that locating necessary cached copies is easy, and only caches w i t h valid copies are involved in coherence actions for a particular h e y deliver best perfbrmance of all block Because of that, t directory schemes. There are couple of dwdvantages, too. Centralizd controller is innexible for system expansion by addmg new processm. Also, these schemes are not scalable for several reasons.Since all requests are directed to the central dmckny, it can become a p e r f m c e bottleneck. The

most serious problem is s@cant memory overhead for U-map directory storage in multiprocessor systems with a h i s large number of processors. One approach to alleviate t problem is presented in [OKra90]. The proposed sectored scheme reduces directory size by increasing t h e block size, while choosing subblock as the coherency unit.

3 . 2 . Limited directorg sheMotivation t o oape with t h e problem of ay overhead in full-map direCtoIy schemes led to centdized schemes with partial maps or limited directories. They replace t h e presence bit vector with a small number of identifim pointing to cached copies (Figure 2b). Condition for storage efficiency of limited directory schemes over m-map directory schemes is given by ilo&N< N,where i is the number of pointas. and N is the number of processors. Size difference between a full and a l i m i t e ddirectory, for small i and large N,is significant. This concept is justified by findings of some studies [Webe891 that t h e number of simultaneously existing cached copies of the same block is usually small. ~tionoflimiteddlrectoryschemesisdescribedin [Agar89]. Entries in l i m i t e d directories contain a fixed number of pointerS. Special actions have to be taken when the number of cached copies exceeds the number of pointers. The schemes with brotldcast capability allow that situation, because t h e y can invalidate all copies using a broadcast si& when necessary. If protocol disallows broadcasts, one copy has to be invalidated, to Eree t h e pointer for a new cached copy (Dir. NB schemes). These protocols put an upper limit on the number of simultaneously cached copies of a particular block. One example of an extremely storage efficient m e t h o d is t h e two-bit scheme [Arcb85]. Instead of using an army of pointers, only two bits are used to encode the state of each block Since exact identities of caches umtaining the copies of block are not known, the scheme has to broadcast t h e coherence command when mcewq. The protocol is easily expandable w i t h o u t further modifications; however, perform-

- sharedmemoq X - datablock P -pointer Y - Validbit D - duty bit CT- chainterminator N - numberofprocessors

A 4

c -cachecacheq


Figure 2: The examples of three directory organizations:

a) a p directory organization b) limited directory organization c) chained dmctory organization


ance degradation for higher levels of sharing and a greater number of processors can be intolerable. An solution which combines paltlal and full

map approaches is given in [OKra90]. Tags of two Wkrent sizes are used. Small tag contains a limited n u m b e r of pointers, while large tag consists of the full-map bit vector. Tags are stored in two associative caches and allocated when needed. On the first reference, a small tag is allocated to the block, but when small tag b e c o m e s insufticient, it is fkd, and a large tag is allocated. When no tags are h e , one of used tags is selected, cormblock is invalidated, and the tag is reallocated The scalability of limited directory protocols, in terms of m e m o r y overhead for directory storage, is quite good; however,theirperfannancehea~dependsonsharingcharacteristics of parallel applications.


4. Snoopy protocols In t h i s p u p of hardware solutiom, the centralized controller and the global state information are not employed. Coherence maiotenanCe is based only on the actions of local

3.3. Chained directory schemes Another way to ensure scalability of directory schemes, with respect to tag storage efficiency, is the introduction of chained, distxibukd directories. It is important that the approach does not limit the number of cached copies. Entries of such a directory are orgamzd m the form of l i n k e d lists, where all caches shanng the same block are chained through pointers into one list (Figure 2c). Unlike two previous directory approaches, chained directories are spread across the ins used only to point dividual caches.Entry in main memory i to the head of the list and to keep the block s t a b . Requests for the block are issued to the memory, and sulxsequent commands from the memory controller are usually forwarded throughthe list, using the pointem. Distributed, chained directories can be organized in the form of either slngly or doubly l i n k e d lists. Distributed directories implemented as s@y linked lists are used in the Stanford Distributed-Ihctory Protocol [Thap90]. Main m e m o r y contains the pointer to t h e head of the l i s t for a particular block, and the last m e m b e r of the list cantainsthe chain terminator. On a read miss, the requester is put on the head of the list and obtains the data fhm the previous head. On a mite request,invalidation signal is forwarded h m the head through the intermediate members to the tail. Replacement of chained copy in a cache is handled by invalidation of the lower part of the list. Completion of actions is acknowledged with reply messages. A distribUted directory with doubly linked lists is p r ~ posed in the SCI (Scalable Coherent Interface) project [Jamego]. Each list entry has a forward and a backward pointer. In this way, the replacement problem is alleviated because an e n t r y can easily be dropped by c h a m q its predecessor and its successor. An entry is added to the head of the list, by updating the pointers for both, the new entry and the old head entry. Some optimizations are incorporated into the coherence mechanism. One of the most important is the possibility of combining the requests for list insertions. Coherence protocol can also be bypassed for private data. The main advantage of chained directory schemes is their scalability, while perfoxmance is almost as good as in fuU-map direotoly schemes [ChaiW]. Because of better hand l q of the replacement situation, doubly linked lists perform slightly b e t t e r COIllpared to s+y linked lists, at the expense of bemg more complex and using twice as much storage for pointers.

cache controllers and distributed local state information. Because of that, all the actions for the currently shared block must be announced to all other caches, via broadcast capability. Local cache controllers are able to snoop on the network, and to recognize thd actions and umditim for coherence violation, which imposes some reactions ( a c c o w to the utili z e d protocol), in order to preserve coherence. Snoopy protocols are ideally suited for multiprocessors which use shared bus as global interconnect, since the shared bus provides very inexpensive and fast broadcasts. They are also known as very costeffective and flexible schemes. However, coherence ach e shared bus additionally increase the bus traffic, tions on t and make the bus saturation more acute. Consequently, only systems with s d to medium number of processors can be supportedby snoopy protocols. Two write policies are usually applied in snoopy protocols: writeinvalidate and write-update (or write-br0aW).
4.1. Write-invalidate snoopy protocols Write-invalidate protocols allow multiple readers, but only one writer at a time. Every write to a shared block must be precsded with the invalidation of all other copies of the same block, preventing the use of stale copies (Figure 3a). Once the block is made exclusive, cheap local writes can proceed u n t i l some other processor requires the same block. f o m the WTI protocol, where The approach originated r write-through to memory on system bus is followed by invalidation of all copies of the involved block. This simple mechanism can be found in some earlier machines. Such an m o phi~ti~ated coherence solution results in p00r system performance because of a very high bus utihtion [Arch86]. Only systems with a very small number of processors can use this scheme. Another very simple scheme, the SCCDC protocol, proa m more t h a n one copy of the block Wend891. hibits c Every read and write miss takes over the copy and makes it private. This mechanism is costeffective for applications with prevalent sharing in the form of migratory data and synchronization variables. The Wnte-Once protocol was the one of the first mite-back snoopy schemes. It was intended for single-board computers in the Multibus environment [Good83].The action on first write to a shared block is the same as in the WTI protocol, leaving the block in the specih reserved state. Sub d i r t y state) proceed losequent write hits to that copy (in cally, u n t i l some other processor requests the block. Another improvement refers to the ability that read misses can be serviced by cache which holds the duly block. An ownership based protocol is developed for the Synapse N+1, a fault tolerant multiprocessor system [FranM]. A bit for each m e m o r y block determines whether memory or cache is the ownez of the block, and prevents race CondifioIIS. Cache ownership ensures cheap local accesses. When cache is the block owner, read m i s s is inefficiently handled, because m e m w has to be updated first, and the read request resubh e n .Another disadvantage is that write hit to shared mitted t


data produces the same action as a write miss, since there i s no invalidation s@. The k k e l e y protocol, implemented in the SPUR multiprocessor, also applies the ownership principle, but improves significautly over the previously described protocol Dk851. Efficient cache-bcache transfers without memory update are enabled by the existence of the new shared duty state. The of the protocol is its inability to recognize main a possible exclusive use, when fetchmg a missed block. A more sophisticated version is also proposed, which uses so& ware hints to distinguish between loads of shared and non-shareddata. the shanng status on block The problem of rec+zhg load is solved in the I h o s protocol [papa84]. Private data are better handled, since the protocol entirely avoids invalidations on Write bits to unmodified non-shared blocks. To this end, the new exclusive unmodihd state is intrcduced.On read miss, eveq cache with a valid copy tries to respond, and to supply data. This requires correct arbitration on the sbared bus. If the block is modified, only one cache responds. Simultaneously with the cache-tocache transfer, a memory update takes place, which can slow down the action.

The CMU RB protocol presented in [Rudo84] tries to solve the main obstacle for perfbmance improvement in write-invalidate protocols-invalidation misses-by introducing block validation. When some cache issues bus read, a l l caches with invalidated blocks catch the data available on the bus, and update their own copies. This read bmdcast cuts down the number of invalidation misses to one per invalidated h e block status. block. Only three states are used to spec^ t Blocks with one word length are assumed., so space locality in programs can not be exploited in a proper way. In spite of the general uselkhess of the r e a .d * t action, t h e r e is a negative side effect, which is the increase of cache inMerence reflected in high processor lockout f r o m the bus ~1~939ai. Many of the previously mentioned useful features are included in the EIP ( E f f i c i e a tInvalidation Protocol) [Arch87l. Besides duty owners,t h i s protocol proposes another new feature-clean cache ownership. When no duly block owner exists, a clean block owner is defined, as the l a s t cache that experienced read miss for the block, unless the block is replad later. It makes handling of read and write misses to most unmodified blocks m o r e efficient compafed to protacols in

0 -






Figure 3 W r i t e strategy in Snoopy protocols: a) invalidationpolicy b) upaate Policy

A 4 - sharedmemory C -cachememory X - datablock

- - distributedwrite

P - upaateddatablock Y - validbit

- - - invalidationsignal


which only memory can be the clean owner. The EIP protocol also applies the concept of block validation as in the RB protocol. Implementatian of the E I P requires an increased hardware complexity. Instead of the, usual f u l l block invalidation, the WIP (Word Invalidationprdoool) employs a partial word invalidation [Tama92]. It allows the existence of invalid words within the prhally valid block, u n t i l the pollution point is reached. AAer the invalidationthreshold is reached the whole block is invalidated. This fine-grain invalidation approach avoids some urmecessary invalidationsand a c h i m better data utilization. Read miss to a partiay.valid block is serviced by a shorter bus read word operation, which gradually recovezsthe block Write miss on the only invalid word in a padally valid block is also optimized. Performance studies show that the write invalidate protocols are good choice for applications characterized with sequential pattern of shanng, while fine-grain sharing can hurt their performance a lot [Egge89].
4.2. Write-update protocols Write-update schemes follow a d i s t r i b u t e d write a p p c h which allows existence of multiple copies with write permission. The word to be written to a shared block is broadcast to all caches, and caches containing that block can update it (Figure 3b). Write-update protocols usually employ a special bus line for dynamic detection of the shanng status for a cache block. This l i n e is activated on the occasion of distributed write, whenever more than one cached copy of the block exists. When shanng of a block ceases, that block is marked as private and the Wbuted writes are no longer necessary. In this way, write-through is used for s h a r e d data, and write-back for private data. A typical protocol fkom this group is the Firefly [ThacBS], which is implemented in the DECs Firefly multiprocessor workstation. Only three states are sufficient to describe the block status, since the invalid state is not requmd. Memory copy is also updated on word broadcast. No clean owner exists and all caches with shared copies respond to read miss, supplying the same data in a synchronizedmanner. A very similar wherence maintenance is applied in the Dragon protocol FlCcr841, which is developed for the Dragon multiprocessor workstation fiom X e r o x PARC.The main difference with the previous protocol is in t h e memory update policy. Dragon does not update memoiy on distributed write, resulting in the need for clean owner (in respect to other cached wpies) described by an additional state. The owner is the last cache that performed a write to a block The only situation which requires memory update is the eviction of the owned block from the cache. Avoidmg hquent memory updates on shared write hits is the source of performance improvement over the F i r e f l y protocol, especially for higher rates of shared ref2zncs [Arch86]. serious perfolmtmce degradation of write-update protocols can be caused by process migration. Allowing the processes to be switched among processors promices false shanng when logically private data become physically shared. In that situation, actual shanng increases and unnecessary wherency overhead is induced. A Dragon-like protocol is proposed in [Pret90] which tries to cope with this problem. It introduces one bit with each block to disuiminate copies referenced by the current and the previous process, called used and unused copies. Comparing this identdier with the running process

identZer, upon each processor operation or bus transaction, unusedwpies canbedetectedandeliminated.Additional bus l i n e is also requiredfor transferring of the dirty ownership. In t h i s way, some unnecessarywrite broadcasts can be avoided and effects of cache intedkmme reduced. On the contrary to write-invalidate protocols, this group is better suited for applications with tightex shanng. In case of sequential sharing and process migration, performance can be seriously hurt with ikquent tmwcewq write broadcasts.
43. Adaptive protocols Evidently, neither of the two appmaches is able to &liver superior perfinmance across all types of workloads. This is the reason why some protowls have been proposed that combine invalidation and update policy in a suitable way. They start with write broadcasts, but when a longer sequence of local writes is encountered or predicted, the invalidation signal for t h e block is sent. These solutions can be regarded as e c h a adaptive, since they attempt to adapt the wherence m nisms to the obserwd and pmhcted data use, in the effort to achievetheoptimalperfomance. One of the first examples of combined policy can be found in the RWB protocol [Rudo84],which is an enhancement of the former R E 3 protocol. The first write to shared block is a write-through, which updates other cached copies and memory, as well. On the second successive write to a block by the same processor, an invalidation signal is sent onto the system bus, making the own copy exclusive. It seems that invalidation threshold of this protocol is too low. Also, other copies need not to be invalidated if used only for reading. The combination of invalidate and update policy, accordingto therewpmdwrite access pattem, is the essence ofthe competitive snooping approach [Kar186]. Coherence maintenance starts with write-updates UnGl t h e i r cost in cycles spent h e total cost of invalidation misses for all processors reaches t possessing the block, and t h e n switches to write-invalidates. Coherence overhead stays within the h c t o r of two, wmpared to the cost of the optimal off-line algorithm. Whether this approach bnngs a performance improvement or not is highly dependent on the sharing pattem of the particular parallel application. A couple of variants of the approach are proposed, as well as t h e i r implementation details. A similar principle is followed in the EDWP protocol h update [Arch88]. A fixed criterion is set for swipolicy to invalidate policy. After three distributed writes by a s q l e processor, unintempted with a reference by any other processor, all remote cached copies will be invalidated. This criterion seems to be a reliable indicator of local block usage. Like the EIP, this protowl also allows cache to obtain clean ownership of the block, improving the efficiency of miss service. Block status is defined with eight possible states. Additionalbuslinefordeterrmna tion of the existence of duty ownersisalsoneeded. Since a wide diversity of wheremx protowls exist in this area, a strong need for standardized cache coherence protocol has appeared The MOESI class of compatiblecoherence protocols is introduced in [Swea86]. The MOESI model enwmpasses five possible block states, according to the validity, o s t of the existing exclusiveness, and ownership statuses. M or protocols m in this class, either as originallywith some adaptations. This class is very flexible, because each cache in t h e system may employ a different coherence


and write policy from the class at the same time. Consistent ~ r view y is still guaranteed The IEEE Futumbus standard bus pvides all necessary signals for the implementation of this model.
4.4. Lock-bPsed protocols Synchronization and mutual exclusion in accessing shared data is the topic very closely related to c o w . Most of the snoopy coherence schemes enforce strict consistency model and do not address these issues explicitly. The common way to achieve exclusive access during write to shared data is using the system bus as semaphore. Meanwhile, some a p proaches h d it useful to d i r e c t l y support synchronization primitives along with the mechanism for Preserving cache coherence. We will mention here two ofthem. The protoool proposed in pita861 is a typical example of a lock-based protocol. It supports efficient busy-wait locking, waiting and unlocking, without use of the test-and-set operation. Two additional states dedicated to lock h d h g are introduced:one for lock possession and the other for lock waiting. This mechanism allows the lock-waiter to work while n l o c k is broadcast on system bus, so the snoop waiting. U busy-wait register of waiter can it upon match. Then, it obtains the lock after priontmd bus arbitration, and interrupts the processor to use the locked data. In t b s way, unsuccessll retries for lock acquiring are completely removed hthe bus which is a critical system resource. The main disadvantage of t h i s scheme is that it does not differentiate between read lock and write lock requests. Lock primitives are also supported in the snoopy protocol described in [Lee90], which combines coherence strategy with synchronization. Waiting for a lock is orgamzd through distributed, hardware implemented FIFO queues. Caches which try to obtain lock are chained in a waiting list. The tag field of each cache entry, besicks state, contains a pointer to the next waitmg cache and the ownt of waiters in the peer pup. The advantage over previous protocols is that shared and exclusive locks can be dismgwhed. s h a r e d locks allow simultaneous access for read lock requesters, which improves performance. This protocol has 13 possible block states, and incurs a significant additional hardware complexity.

5. Coherence in multilevel caches Having in mind the growing disparity between processor
and m e m q speed, cache memories become more and more important. Meanwhile, single level caches are unable to successfully fuml the two usual requirements: to be fast and large enough. Multilevel cache hienuthy seems to be the unavoidable solution to the problem. Lower levels of such a hierarchy are smaller but faster caches. Their task is to reduce miss latency. U p p e r level caches are slower but much larger, in order to attain higher hit ratio and reduce the traffic on the intemmmction network. Specific extensions of existing protocols are needed to ensure coherence in multilevel caches. Among many successful organizations of multilevel caches, three of them appear to be the most important -881. The k s t organization is an extension of s q l e level cache, where every processor has its private hierarchy of caches (Figure 4a). Besides private first level caches, the other two organizations introduce shared caches on upper levels, common for several processors, but with a different access structure. In the second scheme, the upper level cache is multiported between the lower level caches (Figure 4b), while in the t h i r d organhation, the upper level cache is accessed through shared bus hthe lower level caches (Figure 4c). Since coherence maintenance is aggravated in multilevel caches, it is necessary to follow the principle of inclusion, to make it more efficient. This principle implies that the upper level cache is a superset of all caches in the hierarchy below it. In this way, coherence actions towards lower levels are filtered, and reduced only to the really necessary ones,lowering the cache interference. This gain justi6es some space ine5ciency incurred by inclusion. Applying inclusion to these types of cache himchies shows that the conditions for simple coherence maintenance can easily be met in the first type of organizaton, and that the third organization can also be an attractive solution, because of its cost-effectiveness. A specific two-level cache hierarchy is proposed in [Wang89]. Every processor has a m t d y addressed first level cache, and a physically addressed second level cache, with write-back on both levels. V i r t u a l addressing on the first level makes it faster because of avoidmg address translation in the case of hit, although new problems like hand@ of


A 4 - memorymodule
Figure 4:

Multilevel cache Organizations: a) private b) multiport shared c) bus-based shared

SB- systembus
C, - fustlevelcache C, - second level cache CB- clustabus P-P-SQr


synonyms are introducsd. It is solved by maintainingpointers between copies in the first and the second level caches, as a part of an extended invalidation protocol for pnxerv@ coherence. Inclusion property is also applied. In order to maease memory bandwidth, splitting the first level cache into the instruction and the data part is advocated.

snooping and directory schemes. Snooping principle maint a i n s coherence of caches on individual busses, while directorymethod is usedtopreserve coherence among busses. D i rectories for memory blocks reside in large local memory modules. A quite diffmnt multiple-bus architecture is proposed in [Wils87]. Caches and busses are hierarchically organmd in a multilevel tree-like structure. Lower level caches are private and COMected to local busses, while caches on higher levels are shared. ApproPriate invalidation snooping solution (extendedand modified Wriprotocol) is employed for c o w maintenance. Following the inclusion property, it can flter the invalidation actions to lower level caches and restrict c o h y actions only to sections where it is necessary. Sigrdicauttraffic reduction can be attained in this way. A cluster architecture with memory modules distributed among clusters is also proposed This approach is applied in the Encores GigaMax project. For the same type of organization, a somewhat different coherence solution, which accounts the type of sharing on various levels, is proposed in [DeWa90]. Since process allocation which results in tighter sharing on lower levels and looser sharing on higher levels is naturally expected, snoopy write-update (modified Dragon) protocol is used inside the subsystems, and write-invalidate (modified Berkeley) protocol is used on the global level. Another protocol proposed for hierarchical two-level bus tapologv with clusters can be found in [Arch88]. The protocol incorporates selected distributed features of snoopy protocols and consists of an &emd and a global portion. The hierarchical cluster architecture similar to the previous one also characterizes the Data Diffusion System m 8 9 1 . Very large processor caches implement virtual shared memory.Hierarchical write-invalidate ptocol is used cache coh~ ~ l ~ twhich i o n fuUy supports data miption Clusters are connected to hlgher levels through data controllers. That is where a set-associative directory for data blocks on the lower level resides. The controlleralso monitors the activities on the busses in a snooping manner. Stanfords DASH distributed shared memory multipmessor system is composed of common-bus multiprocessor nodes linked by scalable inkrwect of general type [LenogO]. It i s one of the first operational machines with a scalable cache coherent Each node contains private caches and employs a snoopy scheme for cache coherence. System-wide coherence is maintained using a directory-based protocol, which is independmt of the network type. There is no global directory since it is p a r t i t i d and distributed across nodes, as well as memq. Among other means for memory access ophiiation, the DASH supports a more relaxed release consistency model in hardware. So f a r ,hardware and software solutions are most of the time developed mdependenttly of each other, each approach trying to solve the problem only in its own domain.However, we strongly believe that a very promising approach in attainuse of ing the optimal pedonnance is the complementa~~ e a n s .We expect that t h i s is a direchardware and software m tion where real breakthroughs are likely to happen. The MlTs Alewitie m u l t i p r o c e s s o r uses the LimiUSS p t w l which represents a hardware-based cache coherence method supported with a software mechanism [Chaigl]. D i rectory entries implement only a limited number of hardware pointers, in order to be storage efficient, counting that it is sufticient in a vast majority of cases. Exceptional chum-

6. Cache coherence in large shared memory multiprocessors One of the most required characteristics of hardware
cache coherence solutions is the scalability, which depicts the ability to support efficientylarge-scale shared memory multiprocessor systems. In strive for more processing power, a broad range of various architectures has been proposed for those systems -1. Same of them try to retain the advantages of common bus systems and overcome scalability l i m i tations by introducing multiple-bus architectures. Appropriate snoopy protocol extensions are the key issue for these systems.other systems are oriented towards general n ischemes.A firesuent sohtion networh and --based tion is to organize systems into bus-based clusters or subsystems which are connected with some kind of n e t w o r k Canss quently, the schemes that combine principles and advantages of both snoopy and directory protocols on different levels seem to be highly effective for these systems. Wisconsin Multicube is a highly symmetric bus-based shared memory multiprocessor which supports up to 1024 processors [Good88]. Processing nodes are placed in the intersections of busses and memory modules are tied to column busses. Each node includes dedicated two-level caches. Small 6rst level cache k mmiss latency, while a vey large second level cache is intended to reduce bus traffic. It is connected to both row and column busses and is also responsible for coherence maintenance by snooping on both busses. The invalidation protocol issues at most twice as many bus operations on multiple busses in order to acquire blocks, compared to smgle bus systems. The same topology is applied in t h e Aquarius m u l t i p essor [Carl90], but it differs hthe Wisconsin Multicube i n the memory distribution and the cache coherence protocol. Memory is completely distributed across nodes which also contain large caches. Coherence solution combines features of

. . ..-. . . . . ,

j : i ..__._____. -.____...... :
M .I#

-globalmemory module ZCN - globalinterconnectionnetwork cc - Cluster~troller CB - clusterbus LM - localmemory C -cachememory P -psor

Figures: Hierarchical organization of a large-scale shared memory mukipmessor


stances, when more pointers are needed, are handled in so% w a r e .In those idkquent cases, an intenupt is generated, and a full-map directary for the block is emulated. A fast trap mechanism provides support for this feature.

7. Conclusion
This survey tries to give a comprehensive overview of hardwmbased solutions to the cache coherence problem in shared memory multiprocessors. Doubtlessly, this topic deserves much m o r e time and space, so we decided to provide a list of papers for additional readmg, on a number of merent subjects related to the general topic of cache coherence in multiprocessor systems. The great importance of the problem h e applied solution on system perand the strong impact of t formance puts an obligation on the system a r c h i t e c t s and designers to carefully consider the aforementioned issues in their pursuit t o w a r d sm o r e powexfd and more efficient shared memory multiprocessors. Despite of the considerable advancement ofthe field, it still represents a very active research area. Much of the w o r k in developing, implementing, and evaluatingthe solutions should be done with sigdicant prospective benefits.

8. References
[Agar891 Aganval A., Simoni R., Hennessy J., Hmwitz U, An Evaluation of Directory Schemes for Cache Coherence, Procdings o f the 16th ISCA,1989,pp. 280-289. [Arch851 Amhibald J., Baer J. L., An Econormcal Solution to the Cache Coherence Problem, Proceedings o f the 12th ISC4, 1985, . 355-362. [Arch861 A r c h i d J . , Baex J. L., Cache Coherence Protocols: s i n g a Multiprocessor Simulation Model, Evaluation U ACM Tmnsactim on ComputerSystems, Vo1.4, No.4, November 1986, pp. 273-298. [Arch87 Archibald J., The Cache Coherence Problem in Shared-Memory Multiprocessors, PhD Thesis, University February 1987. [Arcd$[h=J., A Cache CoherenceApproach For Large f the SupercomputMultiprocam System,Proceedings o ing Confeerence, 1988, pp. 337-345. paex881 Baer J. L., Wang W.H., On the Inclusion Properties for Multi-level Cache Hierarchies, Proceedings o f the 15h ISCA, 1988pp. 414-423. pita861 Bitar P., Despain A., Multiprousor Cache Synchronization: Issues,Innovations, Evolution, Proceedings o f the 13th ISCA,June 1986,pp. 424-442. [Carl901 Carlton U, Despam A., Aquarius Project, IEEE Computer,Vo1.23, N0.6, June 1990, pp. 80-83. [Cens78] Censier L., Feautrier P., A New Solution to Coherence Problems in Multicache Systems, lEEE Tronsactim on Copnp~ters,V0l.C-27, N0.12, December 1978, pp. 1112-1 118. [Chai90] Chaiken D., Fields C., K e K., Agarwal A:, ?)lrectory-Based Cache Coherence m Larg&e Mulb j , IEEE Computer, Vo1.23, N0.6, June 1990, pp. 49-58. [Chai91] Chaiken D., Kubiatowitz J., Agarwal A., LhitLESS Directories: A Scalable Cache Coherence Scheme, Prof the 18th ISCA,1991,.pp. 224-234. ceedings o 8 1 Dubois M., Scheunch C., Briggs F., Synchronizati~~~, Coherence, and Event ordering in Multiprousors, IEEE Computer, V01.21, N0.2, February 1988, pp. 9-21.

[Egge89] Eggers S. J, Simulation Analysis of Data S Shared Memory M u l t i , Report No.: % U 89/5Ul, University of California, Berkeley,April 1989. [Egg&9a] Eggers S., Kalz R, Evaluating Performance of Four Snooping Cache Coherency Protocols, Pmceedings o f the 16th BCA, 1989, pp. 2-14. p+an84] Frank S . J., Tightly Coupled Multiprocam System Speeds Memcny-access Times, Electronics, 57,1, January 1984, pp. 164-169. [Good831 Good,J. Using Cache Memory To Reduce Pnx; essor-Memory Traffic, Proceedings o f the 10th ISCA, 1983,pp. 124-131. [Good881Goodman J., Woest P.,TheWisumsin Multicube: A New Large-scale cachecoherent Multiprocam, Proceedings o f the 15th lSC.4, May 1988, pp. 422431. m 8 9 ] Haridi S., Hagaskm E., The Cache CohexemeProtocol of the Data Diffusion Machine. Proceedings of the PARLE 89, Vol. 1, 1989, G~essing,S., Sohi, [ J d ] James, D. V., Laun%i,-?T., G.S.. Scalable Coherent Interface, Computer. * - June 1990. pp 74-77. [Kar186] Karlin A., Mauasse M., Rudolph L., Sleator D., Competitive Snoopy Caching, Proceedings o f the 27th Annual Symposium on Foundotim o f Computer Science, 1986, pp. 244-254. o o d D., Perkins C., Sheldon R., [Katz85]Kak R, Eggem S., W a Cache Consistency Protocol, Proceedings o f 12th I CA, 1985, pp.276-283. m 9 0 ] Lee J., Ramachandran U., Spchr(mization with Multiprocessor Caches, h e e d i n g s o f the 17th ISCA, 1990, pp. 27-37. m 0 9 0 ] Lenoski D., Laudon J., Gharacholoo K., Gupta A., Hemessy J., le Directmy-BasedCache Coherence Pm tocol for the DASH Multiprocessor, Proceedings o f the 17th ISCA,1990, pp. 148-159. WcC1-841 McCreight E. M., The Dragon Computer System,an Early Overview, Proceedings o f the NATO Advanced Study Institute m Micnxamhitecture of YLSI Computers. Urbho, Italy, July 1984. mend891 MendelsanA., Pradhan D. K., Slngh A. D., A S u e CODV Data Coherence Scheme for Multi~rocessorSvs &, ComputerArchitecture News, 1989, b. 36-49. [OKra90] OKdka, B. W., Newton A.R., An Em i r i d Evaluation of Two Mem -Efficient Directory M&&, Proceedings o f the 17th I%, 1990,pp. 138-147. Patel, J.,A Low-overhead Coherpapa841 Papqarcm, U, ence Solut~onfor Multip~wessorswith Private Cache Memories, Proceedings o f the 11th iSCA, 1984, pp. 348-354. pet901 Prete C. A., A New Solutionof Coherence Protocol for Tighly Coupled MultipIwessor Systems,Micropmessing mdMicr0 rogming 30,1990, pp. 207-214. Pud0841 RudofphL., SegaU Z., Dynam~c Decentraked Cache Schemes for MJMD Parallel Processors, Proceedings o f the 11th ISCA,1984, pp. 348-354. [Sten90] S t e P., A Survey of Cache Coherence Schemes IEEE Computer, Vo1.23, N0.6, June for M u l t i , 1990, pp. 12-24. [Swea86] S w e a z e y P., Smith A. J., A Class of Compatible Cache ConsistenCy Protocols and their Support by the I E E E Futurebus, Proceedings o f the 13th ISCA, 1986, pp. 414-4423. [Tang761Tang C.,Cache System Design in the Tightly Coupled f the National Multiprocessor System, Pmeedings o Computer Confmce, 1976, pp. 749-753. Thacker C., S t e w a r t L., Satterthwaite E. Firefly: A 8 1 Multiprocessor Workstation, IEEE Micro, February 1988, pp. 5749.

- -


87 1


Thakkar S . , Dubios M.,Laundrie A., Sohi G., Scalable Shared-Memoq Multip~~cesm Architechres,

1, IEEE Computer, Vo1.23, N0.6, June 1990, pp. 78-80. [Toma921Tcamkvi6, M, Milutinovi6 V . , A Simulation Study Cohexnce Protocols, proceedings of the January 1992, pp. 427436. [wang89] Wang W. R, Baer J. L., Levy R,Orgmmhon and Performance of a TwpLevel V i r t u a l R e a l Cache Hierarchy,Pmeedings o f the 16th ISCA, 1989, . 140-148. pi^?] weber w:D., ~slpta A., Analyans 3 c a c h e Iuvalid a h mP a t t e r n sm M u l ~ s s o r s , Proceedings ofthe 3rd ASPWS, April 1989, pp. 243-256. [ W w qWilson A., W * il CaChdBlls A l C h i for Shared Memory Multlpr~~e~sors, proceedings o f the 14th ISCA, 1987, pp. 244-252. e n W. C., Y e n D. W. L., Fu D.S . , Data Coherence Tyen851 Y System, LlGW TmnsoCron on Problem in A Multi& CompurerS,(2-34, Jan~ary 1985, pp. 56-65.

9. Suggestions for further reading

Here is provided an additional list of rehence (not ment i o n e d in the paper because of space restrictim). It gives a deeper insight into the cache c o b p b l m in shared memory multiprocessors. The papers cover a broad range of topics in the general area of cache cohereme, parallel program characteristics, analytic and simdation models of ptocols, performanceevaluation and cOmpafiSOlZetc. [Adv&l] Adve S . , Adve V., Hill I d , Vermm M., Camparison of Hardware and Software Cache Coherence Schemes, Pmeedhgs of the 18rh ZSC4,1991, pp. 298-308. [Agar891 -.A.? Gupta.A., 9 ad Memory Referenw, S p h l hcahty IIL Mdtl
9 1

Technical R e p r t m -397, S Juue 1989. Baer J..L, . W w W. H., Multile~l Cache H i m chi= oqmzat~ons, prdocols and Performance," Joumal ofPamlle1 andDisaibutedComputing, 6,1989,451476. [Cheo89] Cheong H, Veidenbaum, A Version Control A p preach to Cache Coherence, Proceedings o f the International Conjknce on Supercomputing 89, June 1989, pp. 322-330. [ C h d l ] Cheriton D., Goosen R, Boyle P., Paradigm: A H q h l y Scalable Shared-Memary Multicomputer Archite~ ture, Computer,Vol. 24, N0.2, F e w 1991, pp. 33-46. [Cytrs8] Cytron R . ,Karlov S.,McAuliffe, Auto~natic Managemeat of Programma le Caches, Proceedings of the 1988I P P , August 1988, pp. 229-238. [Dub082 Dubois U, Briggs F., Effects of Cache Coherency i n M U/ tip, IEEE Tronsoctons on Computers, V0l.C-31, N0.11, November 1982, pp. 1083-1099.

p@8] Eggers S . ,Katz R, A C . OfSharingin P a r a l l e l Prqgrams and its Application to cohaency Prow col Eval~on,Proceedingsof the 15th E T U ,May 1988, .373-383 @Zggg9] Eggas S.,Katz R, The Effect of S h a r i n g on the Cache and Bus Performance of Parallel Programs, Proceedhgs ofthe 3dASPWS, April 1989,pp. 257-270. [Good871 Goodman J, COfor Mdtl-r Virtual Caches, Proceedings o f the 2nd ASPLQS, 1987, pp. 72-81. Cache Ccw 9 ] Min S.L., Baer J. L., A TipBased herence Scheme, Proceedings o f the 1989 ICPP, pp. 123-132: . ,Agarwal, A., Evaluating the Perf[Owic89] ow1cki,S of Software Cache cohe, Proceedings of the 3rd ASPWS, Apnl1989, pp. 230-242. Correct Memory operation [%he871 Scheunch C., Dupois U, of Cache-Based Multl I S , Pmceedings of the 14th ISC4, June 1987, pp. [Sites81 Sites R Agarwal,A. Multi Cache Analysis Using A * Procedings 15th LSCQ,May 1988, pp. 186-19s. [Smits2] Smith, A. J., Cache Memories, Cmputing Surveys, 14,3, September 1982 473-530. [Smit%S]srmth A. J., C&ache Consistency w i t h Software Support and Using One-Time Identifiers, Proceedings o f the Pacific Computer CormtunicationsSymposium, Seoul, October 1987,pp. 153-161. [Sten891 Stenstrom P.. A Cache Coherence Protocol for Multiwith Multistage Networks,Pmeedings of the May 1989, pp. 407415. part921 T d j a I., Milutinovik V . , An Apjxoach to ! Software Cache Consistency Maintenance ased on Con&tional Iuvalidation, Proceedings of the 25th HICSS, January 1992, pp. 457-466. [ThaP90] Thapar U, D e w B., Cache Coherence for Sbared Mem Multilnmcesm, Pmceedings of the 2nd Annual y on Pamllel Algorithms and ArchitecACMY turn, July 1990, pp. 155-160. p d ] V e U, Holliday U,Perfonn8nce Asalysis of M u l t i 7 cache Consistency Protocols U - Generalized Timed P e t r i Nets, Proceedings of the Petfomance86andACMSigmetrics 1986, May 1986, pp. 9-17. Verqaq M., Lazowska, E., ma^, ! . , An Accurate p~xn881 and Effiaent Perfinmance +ym Techque for Multup Protocols, Proceedings o f & 15th Z U % 3 0 % 3 1 S . [wood891Woodbury P., Wilson A., Shem B., Gertner L, Chen P. Y.,Bartllet J., A r a l Z., Shared Memory Multiproces5ors: The Right to Parallel processing, Pmeedings of the Spring cnnpmn89,1989, pp. 72-80. peh831 Y e h C., P a J., Davidsor~ E., Shared Cache for M u l ti le-stream computer Systems, Z E E E Tmnsactim on C%nputers, C-32(1), January 1983, pp. 3847.


~ $ 7




A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors M. Tomasevic and V.Milutinovic

Please see page 863 for this paper