You are on page 1of 11


Milo TomaievZ Department of Computer Engineering Pupin Institute POB 15 11000 Belgrade, Yugoslavia email:etomasev@yubgefi1.bitnet
Appropriate solution to the well-known cache coherence problem in shared memory multiprocessors is one of the key issues in improvingperformance and scalability o f these systems. Hardware methods are high& convenient because of their transparency for sofhare. They also offer goodperfonnance since thty deal with the problem filly dynamically. Great variety of schemes hm been proposed; not mrmy of them were implemented. Two mdor groups of hardware protocols can be recognized: directory and snoopy. This survey underlines their principles and summarizes a relatively large number of relevant representatives j?om both groups. Coherence problem in multilevel caches is also briejly considered. Special attention is devoted to cache coherence maintenance in large scalable shared memory multiprocessors.

Veljko Milutinovi6 School of Electrical Engineering University of Belgrade POB 816 11000 Belgrade, Yugoslavia email:milutino@pegasus. ch
bandwidth. However, t h i s solution imposes another serious problem. Multiple copies of the same data block can exist in Merent caches, and ifprocessors are allowed to update freely m m i their own copies, inconsistent view of the memory is i nent, leato program malfunction. This is the essence of the well-known cache coherence problem [Dubo88]. The reasons for coherence violation are: shanng of writable data, process migation, and YO activities. System of caches is said to be coherent if every read by any processor h d s a value produced by the last previous write, no matter which prowssor performed it. Because of that, system must incorporate some cache coherence mainhaace mechanism which consists of a complete and consistent set of operations (for accessing shared memory), preserves coherent view of the memory,and ensures program execution with correct data. Cache coherence problem has attracted considerable attention through the last decade.A lot of research at prominent universities and companies has b e e n devoted to that problem, resulting in a number of proposed solutions. Importance of the problem is emphasized by the fact that not only the cache coherence solution is necessary for correct program execution, but it can have a very sigmiicant impact on system performance, and it is of utmost importance to employ cache coherence solutions as efficiently as possible. It is firmly proved that the efficiency of cache coherence solution depends to the great extent on system architecture parameters and especially on parallel p r o m characteristics P g g e 8 9 1 . Choice of apo r e critical in large multiprocespropriate solution is even m sors where some inefficiencies in preserving the coherence maintenance are multiplied, which can seriously hurt the scalability.

1. Introduction
Multiprocessors are the most appropriate type of computer systems to meet the ever-increasing needs for more computing power. Among them, shared m e m o r y multiprocessors represent an especially popular and efficient class. Their increasing use is due to some sigmiicant advantages that they offer. The most important advantage is the simplest and the most general programming model, which allows easier development of parallel software, and supports efficient shanng of code and data. However, nothmg comes for fiee. Shared memory multiprocessors sufFer from potential problems in achieving high performance, because of the inherent contention in accessing shared resources, whch is responsible for longer latencies in accessing shared memory. htrodwtion of private caches attached to the processors helps greatly in reducing average latencies (Figure 1). Caches in multiprocessors are even more useful then in uniprocessors, since they also increase effective memory and communication
This research was partially sponsored by the NCR Corporation, Augsburg, Germany, and the NSF of Serbia, Belgrade, Yugoslaia The EhDOT simulation took were donated by the ZYCAD Corporation, Menlo Park Calfomia. USA. and the TDT Corporation, Cleveland Heights. Ohio, USA. The MOSIS compatible W design took were provided by the T m Corporation,Pasadena. Calrfomia, USA.The W design took were provided by the ALTER.4 Copration, pm Sonto lara. California, USA.

A 4

- memorymodule - privatecache
- processor
interconnection network






proceedings oftheHawaii International Confirerice on system Sciences, K o k . Hawaii, USA.,January 54.1993.

F’igure 1: A shared memoxy multiprocessorsystem with private caches


0-1060-3425/93 $03.00 @ 1993 IEEE

hardware protocols free the programmer and compiler h m any responsibility about coherence maintenance. and X is B and kB. the following points of view will be underlined: 0 Research environment Essence of the approach 0 Selected details of the approach 0 Advantages Disadvantages 0 Special comments (performance. and issues necessaq commands for data transfer between memoxy and caches. 3. etc.Hardware s c h e s deal with coherence problem by dynamic recognition of inconsistency conditions for shared data entirely at n m t i m e . Directory protocols are prrmanly suitable for multiprocessors with general interconnection netWorks. An enby points to exact locations of every cached copy of memory block. Decisions about coherence related actions are often made statically durmg the compiler analysis (which tries to detect conditions for coherence violation). The simplest but the most restrictive method is to declarenoncacheable pages of shared data. especially in static schemes. and keeps its status. usually called cache coherence protocols.2.Besides the global drrectory maintained by the central controller. employing of previous methods in large scalable multiprocessors with cache-mherent architectures is considered.).Being totally transparent to software. or between caches themselves. This is a traditional classi6cation that still holds. and contains entries for each memory block. This survey follows one specific organizational structure.1. In the next two sections. Software-based solutions will not be elaborated upon any further in this paper. Meanwhile. becauseof space restrictions. Software schemes are generally less expensive than their hardware counterparb.2. It is also responsible for keeping status information up-to-date. and also much more fiequently implemented in commercial multiprocessor systems. their cost is well justified by siguificant advantages of the hardware-based approach . so every local action which can affect the global state of the block must be reported to t h e central cont r o l l e r . or operating system. hardware cache coherence protocols are much more investigated in the literature. Although they require an i n d hardware complexity. There are also some dynarmc methods based on the operating system actions. although t h e y may require considerable hardware support. since the proposed solutions (so far) follow fully or predormnantly one of the two appaches. Essential characteristics of existing hardware schemes are reflected in respect to the following criteria: Where the status information about data blocks is held and how it is orgamzed. 3. for broadcast and no-broadcast schemes. in deidmg with the coherence problem. Hardware solutiom We have focused our attention on t h e hardware-based solutions. Which write strategy is accepted. compiler. it is not exhaustive.X. some disadvantages are evident.1.It is also claimed that they are more convenient for large. compafed to the system costs. W h o is responsible for preserving the coherence in the system. . different examples are presented. Acmrdmg to the most widely accepted classification based on the first and the second criterion. scalable multiprocessors. They promise better performance. Directory protocols The main characteristic that distinguishes this group of schemes is that the global system-wide status information relevant for coherence maintenance is stored in some kind of directory. The responsibility for coherence preserving is predominantly delegated to a centralized controller which is usually a part of the main memory controller. basic principles are presented. M o r e advanced methods allow the caching of shared data and accessmg them only in critical sections. since the coherence overhead is generated only when actual shanng of data takes place. all solutions to the problem can be classi6ed into two large groups: software-based and hardware-based [StenW]. however. 0 V a r i o u s proposed hardware schemes efficiently support the 111 range fkom small to large scale multiprocessors. private caches store some local state information about cached blocks. centraked mtroller checks the directory. and impose no restrictions to any layer of software.Technology advances made their cost quite acceptable. flush. cache coherence in multilevel hierarchies is discussed. After that. Us- 864 . in a mutually exclusive way. so it seems that this classification will be of less use in t h e hture. and chained directory schemes. etc. This survey tries to be as broad as possible. Then. Which type of notification is used. 0 What kind of coherence actions are applied. 2. where i is the number of pointers. Cache coherence solutions Basically. The global directory can be orgamzd in several ways. . Each approach to be considered here is briefly introduced. and numerous solutions belonging to both groups are dmctory. in order to maintain coherence. Whenever appropriate. Amto that issue. Paper [Agar891 introduces one useful classfication of directory schemes denotmg them as Dir. Due to aforementioned reasons. solutions using combination of hardware and software means become more hquent and promising. !Wtware solutions Software-based solutions generally rely on the actions of the programmer.Upon the individual requests of the local cache controllers. hardware cache coherence schemes can be principally divided into two large groups: directoq and snoopy protocols. Special cache managing instructions are o h used for cache bypass. On the other side. 2. especially for higher levels of data shanng. Fmally. respectively. and indiscriminate or selective invalidation. where inevitable inefficiencies are incurred since the compiler analysis is uuable to predict the flow of program execution accurately and umservative assumptions have to be made. directory methods can be divided i n t o three groups: hll-map directory. Full-map directory schemes The main characteristic of these schemes is that the directory is stored in the main memory.

M e m o r y directory is identidy orpmad. because duplicates of all cache dhctolies .Local cache dkctories store two bits per cache block (valid and modified bits). these schemes are not scalable for several reasons. however. Also. too. NB schemes). One approach to alleviate t problem is presented in [OKra90]. The most serious problem is s@cant memory overhead for U-map directory storage in multiprocessor systems with a h i s large number of processors.datablock P -pointer Y . It use^ an invalidation approach and allows the existence of multiple unmodified cached copies of the same block in t h e system.Since all requests are directed to the central dmckny.numberofprocessors A 4 c -cachecacheq c) Figure 2: The examples of three directory organizations: a) “ a p directory organization b) limited directory organization c) chained dmctory organization 865 . Entries in l i m i t e d directories contain a fixed number of pointerS. perform- . One example of an extremely storage efficient m e t h o d is t h e “two-bit scheme” [Arcb85]. A very similar directory scheme is described in [Yen85]. the scheme has to broadcast t h e coherence command when mcewq. it can become a p e r f m c e bottleneck.NB scheme). Directory contains entries for each memory block in t h e form of a bit vector (Dir. 2 . The schemes with brotldcast capability allow that situation. This eliminates the need for directory access for write hits to umnodified private blocks. Special actions have to be taken when the number of cached copies exceeds the number of pointers. one copy has to be invalidated. The main advantage of the U-map approach is that locating necessary cached copies is easy. There are couple of dwdvantages. Size difference between a full and a l i m i t e ddirectory. only two bits are used to encode the state of each block Since exact identities of caches umtaining the copies of block are not known. If protocol disallows broadcasts. Centralizd controller is innexible for system expansion by addmg new processm. but onlyanemodifiedcopy. Limited directorg sheMotivation t o oape with t h e problem of “ay overhead in full-map direCtoIy schemes led to centdized schemes with partial maps or limited directories. Such an organization implies some serious disadvantages. Directoly entry consists of N + 1 bits: one presence bit per each of N processorcache pairs. This concept is justified by findings of some studies [Webe891 that t h e number of simultaneously existing cached copies of the same block is usually small. and N is the number of processors. 3 . but provides a [Cens78] applies t much more efficient directory significant. They replace t h e presence bit vector with a small number of identifim pointing to cached copies (Figure 2b). MemOrydireCtoIy is orgamedas a set of copies of all individual cache directories. but implies t h e burden for maintainingcorrect value of this bit. ~tionoflimiteddlrectoryschemesisdescribedin [Agar89]. . The proposed sectored scheme reduces directory size by increasing t h e block size. cache directorieshas to be consistent all the t Classical full-map directory scheme proposed in h e same coherence policy. The protocol is easily expandable w i t h o u t further modifications. Condition for storage efficiency of limited directory schemes over m-map directory schemes is given by ilo&N< N.sharedmemoq X . These protocols put an upper limit on the number of simultaneously cached copies of a particular block. Instead of using an army of pointers. Directory search is not easy. to Eree t h e pointer for a new cached copy (Dir.Validbit D . because t h e y can invalidate all copies using a broadcast si& when necessary.chainterminator N .duty bit CT. mustbecheckedwhende&“q the status of a particular memory block. avoidmg u s u a l l y expensive broadcasts. and one bit which denotea whether t h e block i s modified in one of the caches (Figure 2 a ) . Also. cohof data in private caches is maintainedby directed m e s s a g e s to known locations. t directory schemes. information in memory directoly and i m e .ing this inf-tion. and only caches w i t h valid copies are involved in coherence actions for a particular h e y deliver best perfbrmance of all block Because of that.where i is the number of pointas. but cache entries haw an additional bit per block. whi& is set in t h e case of clean exclusive copy. The Grst protocol ftom this class is developed in [Tans761 and later impleme&d in IBM3081. while choosing subblock as the coherency unit. for small i and large N.

theirperfannancehea~dependsonsharingcharacteristics of parallel applications. because m e m w has to be updated first. while perfoxmance is almost as good as in fuU-map direotoly schemes [ChaiW]. where The approach originated r write-through to memory on system bus is followed by invalidation of all copies of the involved block. and sulxsequent commands from the memory controller are usually forwarded throughthe list. proa m more t h a n one copy of the block Wend891. Local cache controllers are able to snoop on the network. When no tags are h e . Each list entry has a forward and a backward pointer. 4.ance degradation for higher levels of sharing and a greater number of processors can be intolerable. Tags of two Wkrent sizes are used. is the introduction of chained. Two write policies are usually applied in snoopy protocols: writeinvalidate and write-update (or write-br0aW). by updating the pointers for both. but when small tag b e c o m e s insufticient. inww 4. The main advantage of chained directory schemes is their scalability. A distribUted directory with doubly linked lists is p r ~ posed in the SCI (Scalable Coherent Interface) project [Jamego]. the new entry and the old head entry. Main m e m o r y contains the pointer to t h e head of the l i s t for a particular block. Cache ownership ensures cheap local accesses. Another very simple scheme. In this way.Another disadvantage is that write hit to shared mitted t 866 . preventing the use of stale copies (Figure 3a). However. read m i s s is inefficiently handled. It was intended for single-board computers in the Multibus environment [Good83]. Write-invalidate snoopy protocols Write-invalidate protocols allow multiple readers. Snoopy protocols In t h i s p u p of hardware solutiom. Once the block is made exclusive. cormblock is invalidated. An solution which combines paltlal and full map approaches is given in [OKra90]. the SCCDC protocol. while large tag consists of the full-map bit vector.The action on first write to a shared block is the same as in the WTI protocol.invalidation signal is forwarded h m the head through the intermediate members to the tail. Because of that. and the last m e m b e r of the list cantainsthe chain terminator. tions on t and make the bus saturation more acute. Distributed directories implemented as s@y linked lists are used in the Stanford Distributed-Ihctory Protocol [Thap90]. a fault tolerant multiprocessor system [FranM]. Every write to a shared block must be precsded with the invalidation of all other copies of the same block. using the pointem. cheap local writes can proceed u n t i l some other processor requires the same block. but only one writer at a time. Only systems with a very small number of processors can use this scheme. Distributed. all the actions for the currently shared block must be announced to all other caches. chained directories can be organized in the form of either slngly or doubly l i n k e d lists. cache controllers and distributed local state information. Chained directory schemes Another way to ensure scalability of directory schemes. It is important that the approach does not limit the number of cached copies.3. and a large tag is allocated. The Wnte-Once protocol was the one of the first mite-back snoopy schemes. Small tag contains a limited n u m b e r of pointers. chained directories are spread across the ins used only to point dividual caches. On a mite request. in terms of m e m o r y overhead for directory storage. One of the most important is the possibility of combining the requests for list insertions. with respect to tag storage efficiency. An entry is added to the head of the list. coherence ach e shared bus additionally increase the bus traffic. They are also known as very costeffective and flexible schemes. f o m the WTI protocol. hibits c Every read and write miss takes over the copy and makes it private. via broadcast capability. the centralized controller and the global state information are not employed. leaving the block in the specih ‘‘reserved” state. however. and to recognize thd actions and umditim for coherence violation. at the expense of bemg more complex and using twice as much storage for pointers. Another improvement refers to the ability that read misses can be serviced by cache which holds the ‘‘duly” block. Replacement of chained copy in a cache is handled by invalidation of the lower part of the list. An ownership based protocol is developed for the Synapse N+1. Snoopy protocols are ideally suited for multiprocessors which use shared bus as global interconnect. only systems with s d to medium number of processors can be supportedby snoopy protocols. u n t i l some other processor requests the block. a small tag is allocated to the block. distxibukd directories. where all caches shanng the same block are chained through pointers into one list (Figure 2c). which imposes some reactions ( a c c o w to the utili z e d protocol). On a read miss. When cache is the block owner. Coherence maiotenanCe is based only on the actions of local 3. Requests for the block are issued to the memory. since the shared bus provides very inexpensive and fast broadcasts. Sub‘ d i r t y ‘ ‘ state) proceed losequent write hits to that copy (in ‘ cally. in order to preserve coherence. On the first reference. This mechanism is costeffective for applications with prevalent sharing in the form of migratory data and synchronization variables. one of used tags is selected. Coherence protocol can also be bypassed for private data. Because of better hand l q of the replacement situation. This simple mechanism can be found in some earlier machines. Some optimizations are incorporated into the coherence mechanism. the requester is put on the head of the list and obtains the data fhm the previous head.1. Entries of such a directory are orgamzd m the form of l i n k e d lists. is quite good. and the read request resubh e n . A bit for each m e m o r y block determines whether memory or cache is the ownez of the block. doubly linked lists perform slightly b e t t e r COIllpared to s+y linked lists.Entry in main memory i to the head of the list and to keep the block s t a b . the replacement problem is alleviated because an e n t r y can easily be dropped by c h a m q its predecessor and its successor. and prevents race CondifioIIS. Tags are stored in two associative caches and allocated when needed. Unlike two previous directory approaches. Consequently. Such an m o phi~ti~ated coherence solution results in p00r system performance because of a very high bus utihtion [Arch86]. Completion of actions is acknowledged with reply messages. and the tag is reallocated The scalability of limited directory protocols. it is fkd.

-X G a) F’igure 3 W r i t e strategy in Snoopy protocols: a) invalidationpolicy b) upaate Policy A 4 .invalidationsignal 867 . which uses so& ware hints to distinguish between loads of shared and non-shareddata. block. since the protocol entirely avoids invalidations on Write bits to unmodified non-shared blocks.upaateddatablock Y . This requires correct arbitration on the sbared bus. only one cache responds. which can slow down the produces the same action as a write miss. since there i s no invalidation s@. and update their own copies. Besides duty owners. the shanng status on block The problem of rec+zhg load is solved in the I h o s protocol [papa84]. the new “exclusive unmodihd” state is intrcduced. when fetchmg a missed block.On read miss.validbit .d * t action. implemented in the SPUR multiprocessor.datablock . When no duly block owner exists. When some cache issues bus read. If the block is modified. also applies the ownership principle. a clean block owner is defined. The k k e l e y protocol. In spite of the general uselkhess of the r e a . which is the increase of cache inMerence reflected in high processor lockout f r o m the bus ~1~939ai. Only three states are used to spec^ t Blocks with one word length are assumed..t h i s protocol proposes another new feature-clean cache ownership. The CMU RB protocol presented in [Rudo84] tries to solve the main obstacle for perfbmance improvement in write-invalidate protocols-invalidation misses-by introducing block validation. eveq cache with a valid copy tries to respond. a l l caches with invalidated blocks catch the data available on the bus. To this end.. a memory update takes place. Many of the previously mentioned useful features are included in the EIP ( E f f i c i e a tInvalidation Protocol) [Arch87l..distributedwrite P . 0 - c.. but improves significautly over the previously described protocol Dk851. The of the protocol is its inability to recognize main a possible exclusive use.sharedmemory C -cachememory X . Efficient cache-bcache transfers without memory update are enabled by the existence of the new “shared duty“ state.. I V 0 - V 1 V r . and to supply data. PlADX c. Simultaneously with the cache-tocache transfer. unless the block is replad later. A more sophisticated version is also proposed. It makes handling of read and write misses to most unmodified blocks m o r e efficient compafed to protacols in n oo. so space locality in programs can not be exploited in a proper way. as the l a s t cache that experienced read miss for the block. t h e r e is a negative side effect.. This read bmdcast cuts down the number of invalidation misses to one per invalidated h e block status. Private data are better handled.

A Dragon-like protocol is proposed in [Pret90] which tries to cope with this problem. actual shanng increases and unnecessary wherency overhead is induced. Additionalbuslinefordeterrmna ’ tion of the existence of duty ownersisalsoneeded. No clean owner exists and all caches with shared copies respond to read miss. Instead of the. whenever more than one cached copy of the block exists. other copies need not to be invalidated if used only for reading.valid block is serviced by a shorter bus read word operation.2. The combination of invalidate and update policy. since the invalid state is not requmd. These solutions can be regarded as e c h a adaptive. Coherence overhead stays within the h c t o r of two. On the second successive write to a block by the same processor. as well as t h e i r implementation details. Whether this approach bnngs a performance improvement or not is highly dependent on the sharing pattem of the particular parallel application. all remote cached copies will be invalidated. The first write to shared block is a write-through. and t h e n switches to write-invalidates. Like the EIP. improving the efficiency of miss service. neither of the two appmaches is able to &liver superior perfinmance across all types of workloads. The MOESI model enwmpasses five possible block states. This class is very flexible. A fixed criterion is set for swipolicy to invalidate policy. accordingto therewpmdwrite access pattem. Write-update protocols usually employ a special bus line for dynamic detection of the shanng status for a cache block. a strong need for standardized cache coherence protocol has appeared The MOESI class of compatiblecoherence protocols is introduced in [Swea86]. They start with write broadcasts. One of the first examples of combined policy can be found in the RWB protocol [Rudo84]. resulting in the need for clean owner (in respect to other cached wpies) described by an additional state. This l i n e is activated on the occasion of distributed write. the invalidation signal for t h e block is sent. A typical protocol fkom this group is the Firefly [ThacBS]. It allows the existence of invalid words within the prhally valid block. especially for higher rates of shared ref€2znc€s [Arch86]. as well. since they attempt to adapt the wherence m nisms to the obserwd and pmhcted data use.which only memory can be the clean owner. performance can be seriously hurt with ikquent tmwcewq write broadcasts. called used and unused copies. because each cache in t h e system may employ a different coherence 868 .The main difference with the previous protocol is in t h e memory update policy. an invalidation signal is sent onto the system bus. o s t of the existing exclusiveness. It seems that invalidation threshold of this protocol is too low. which is developed for the Dragon multiprocessor workstation fiom X e r o x PARC. Implementatian of the E I P requires an increased hardware complexity. When shanng of a block ceases. and write-back for private data. AAer the invalidationthreshold is reached the whole block is invalidated. is the essence ofthe competitive snooping approach [Kar186]. In this way. The EIP protocol also applies the concept of block validation as in the RB protocol. In that situation. making the own copy exclusive. Read miss to a partiay. Adaptive protocols Evidently. either as originallywith some adaptations. 4. usual f u l l block invalidation. the WIP (Word Invalidationprdoool) employs a partial word invalidation [Tama92]. Dragon does not update memoiy on distributed write. In case of sequential sharing and process migration. this protowl also allows cache to obtain clean ownership of the block. wmpared to the cost of the optimal off-line algorithm. It introduces one bit with each block to disuiminate copies referenced by the current and the previous process. according to the validity. Comparing this identdier with the running process identZer. in the effort to achievetheoptimalperfomance. write-through is used for s h a r e d data. some unnecessarywrite broadcasts can be avoided and effects of cache intedkmme reduced. supplying the same data in a synchronizedmanner. This “fine-grain“ invalidation approach avoids some urmecessary invalidationsand a c h i m better data utilization. Allowing the processes to be switched among processors promices false shanng when logically private data become physically shared. The word to be written to a shared block is broadcast to all caches. Also. Write-update protocols Write-update schemes follow a d i s t r i b u t e d write a p p c h which allows existence of multiple copies with write permission. while fine-grain sharing can hurt their performance a lot [Egge89]. Block status is defined with eight possible states. that block is marked as private and the Wbuted writes are no longer necessary. which gradually recovezsthe block Write miss on the only invalid word in a padally valid block is also optimized. Only three states are sufficient to describe the block status. this group is better suited for applications with tightex shanng. but when a longer sequence of local writes is encountered or predicted. A similar principle is followed in the EDWP protocol h update [Arch88]. Memory copy is also updated on word broadcast. unusedwpies canbedetectedandeliminated. 43. Coherence maintenance starts with write-updates UnGl t h e i r cost in cycles spent h e total cost of invalidation misses for all processors reaches t possessing the block. Avoidmg hquent memory updates on shared write hits is the source of performance improvement over the F i r e f l y protocol. This is the reason why some protowls have been proposed that combine invalidation and update policy in a suitable way. A couple of variants of the approach are proposed. Since a wide diversity of wheremx protowls exist in this area. unintempted with a reference by any other processor. serious perfolmtmce degradation of write-update protocols can be caused by process migration. On the contrary to write-invalidate protocols. The owner is the last cache that performed a write to a block The only situation which requires memory update is the eviction of the owned block from the cache. After three distributed writes by a s q l e processor. Performance studies show that the write invalidate protocols are good choice for applications characterized with sequential pattern of shanng. which updates other cached copies and memory. and caches containing that block can update it (Figure 3b).which is an enhancement of the former R E 3 protocol. and ownership statuses. M or protocols m in this class. In t h i s way. This criterion seems to be a reliable indicator of local block usage. A very similar wherence maintenance is applied in the Dragon protocol FlCcr841. u n t i l the pollution point is reached. upon each processor operation or bus transaction. which is implemented in the DEC’s Firefly multiprocessor workstation.Additional bus l i n e is also requiredfor transferring of the dirty ownership.

hardware implemented FIFO queues. The advantage over previous protocols is that shared and exclusive locks can be dismgwhed. unsuccessll retries for lock acquiring are completely removed hthe bus which is a critical system resource.and write policy from the class at the same time. Coherence in multilevel caches Having in mind the growing disparity between processor and m e m q speed. the upper level cache is accessed through shared bus hthe lower level caches (Figure 4c). although new problems like hand@ of b) A 4 . where every processor has its private hierarchy of caches (Figure 4a). Lock primitives are also supported in the snoopy protocol described in [Lee90]. Besides private first level caches. and a physically addressed second level cache. Since coherence maintenance is aggravated in multilevel caches. Specific extensions of existing protocols are needed to ensure coherence in multilevel caches. which improves performance.lowering the cache interference. the other two organizations introduce shared caches on upper levels.clustabus P-P-SQr 869 . In this way.systembus C. single level caches are unable to successfully fuml the two usual requirements: to be fast and large enough. U busy-wait register of waiter can it upon match. 4. . while in the t h i r d organhation. In t b s way. and incurs a significant additional hardware complexity. The protoool proposed in pita861 is a typical example of a lock-based protocol.4. besicks state. without use of the test-and-set operation. Meanwhile. Waiting for a lock is orgamzd through distributed. in order to attain higher hit ratio and reduce the traffic on the intemmmction network. Applying inclusion to these types of cache himchies shows that the conditions for simple coherence maintenance can easily be met in the first type of organizaton. The common way to achieve exclusive access during write to shared data is using the system bus as semaphore. In the second scheme. Their task is to reduce miss latency. This protocol has 13 possible block states. Consistent “ ~ r view y is still guaranteed The IEEE Futumbus standard bus pvides all necessary signals for the implementation of this model. Caches which try to obtain lock are chained in a waiting list. which combines coherence strategy with synchronization. cache memories become more and more important. The tag field of each cache entry. 5. it obtains the lock after priontmd bus arbitration. Lower levels of such a hierarchy are smaller but faster caches. three of them appear to be the most important -881. Two additional states dedicated to lock h d h g are introduced:one for lock possession and the other for lock waiting. and interrupts the processor to use the locked data. with write-back on both levels. Among many successful organizations of multilevel caches. s h a r e d locks allow simultaneous access for read lock requesters. The k s t organization is an extension of s q l e level cache. V i r t u a l addressing on the first level makes it faster because of avoidmg address translation in the case of hit. some a p proaches h d it useful to d i r e c t l y support synchronization primitives along with the mechanism for Preserving cache coherence.memorymodule F’igure 4: Multilevel cache Organizations: a) private b) multiport shared c) bus-based shared SB. coherence actions towards lower levels are filtered. and that the third organization can also be an attractive solution. it is necessary to follow the principle of inclusion. common for several processors. Then. This gain justi6es some space ine5ciency incurred by inclusion. We will mention here two ofthem. It supports efficient busy-wait locking.second level cache CB. A specific two-level cache hierarchy is proposed in [Wang89]. and reduced only to the really necessary ones. waiting and unlocking. Most of the snoopy coherence schemes enforce strict consistency model and do not address these issues explicitly. Meanwhile. but with a different access structure.fustlevelcache C. Multilevel cache hienuthy seems to be the unavoidable solution to the problem. The main disadvantage of t h i s scheme is that it does not differentiate between read lock and write lock requests. so the snoop waiting. This principle implies that the upper level cache is a superset of all caches in the hierarchy below it. to make it more efficient. the upper level cache is multiported between the lower level caches (Figure 4b). contains a pointer to the next waitmg cache and the ownt of waiters in the peer pup. because of its cost-effectiveness. U p p e r level caches are slower but much larger. . Every processor has a m t d y addressed first level cache. Lock-bPsed protocols Synchronization and mutual exclusion in accessing shared data is the topic very closely related to c o w . This mechanism allows the lock-waiter to work while n l o c k is broadcast on system bus.

The MlT’s Alewitie m u l t i p r o c e s s o r uses the LimiUSS p t w l which represents a hardware-based cache coherence method supported with a software mechanism [Chaigl].other systems are oriented towards general n ischemes.. D i rectories for memory blocks reside in large local memory modules.. That is where a set-associative directory for data blocks on the lower level resides. : M . it can flter the invalidation actions to lower level caches and restrict c o h “ y actions only to sections where it is necessary.. a broad range of various architectures has been proposed for those systems -1. *. A quite diffmnt multiple-bus architecture is proposed in [Wils87]..I# : M -globalmemory module ZCN .. The invalidation protocol issues at most twice as many bus operations on multiple busses in order to acquire blocks.__. A cluster architecture with memory modules distributed among clusters is also proposed This approach is applied in the Encore’s GigaMax project.synonyms are introducsd. Same of them try to retain the advantages of common bus systems and overcome scalability l i m i tations by introducing multiple-bus architectures.. Processing nodes are placed in the intersections of busses and memory modules are tied to column busses. D i rectory entries implement only a limited number of hardware pointers.. Wisconsin Multicube is a highly symmetric bus-based shared memory multiprocessor which supports up to 1024 processors [Good88].localmemory C -cachememory P -p”sor Figures: Hierarchical organization of a large-scale shared memory mukipmessor 870 ..____. Since process allocation which results in tighter sharing on lower levels and looser sharing on higher levels is naturally expected. Cache coherence in large shared memory multiprocessors One of the most required characteristics of hardware cache coherence solutions is the scalability. The hierarchical cluster architecture similar to the previous one also characterizes the Data Diffusion System m 8 9 1 . which accounts the type of sharing on various levels. It i s one of the first operational machines with a scalable cache coherent Each node contains private caches and employs a snoopy scheme for cache coherence. each approach trying to solve the problem only in its own domain. -. Small 6rst level cache “ k mmiss latency. The same topology is applied in t h e Aquarius m u l t i p essor [Carl90]. . System-wide coherence is maintained using a directory-based protocol.. There is no global directory since it is p a r t i t i d and distributed across nodes. snoopy write-update (modified Dragon) protocol is used inside the subsystems. . Lower level caches are private and COMected to local busses. Sigrdicauttraffic reduction can be attained in this way. .Cluster~troller CB . ApproPriate invalidation snooping solution (extendedand modified Wriprotocol) is employed for c o w maintenance. For the same type of organization..hardware and software solutions are most of the time developed mdependenttly of each other. Coherence solution combines features of . but it differs hthe Wisconsin Multicube i n the memory distribution and the cache coherence protocol. Memory is completely distributed across nodes which also contain large caches. while a vey large second level cache is intended to reduce bus traffic.-. as well as memq. The controlleralso monitors the activities on the busses in a snooping manner. compared to smgle bus systems. Caches and busses are hierarchically organmd in a multilevel tree-like structure. we strongly believe that a very promising approach in attainuse of ing the optimal pedonnance is the complementa~~ e a n s . which is independmt of the network type.We expect that t h i s is a direchardware and software m tion where real breakthroughs are likely to happen. Another protocol proposed for hierarchical two-level bus tapologv with clusters can be found in [Arch88].. It is solved by maintainingpointers between copies in the first and the second level caches. splitting the first level cache into the instruction and the data part is advocated. Exceptional chum- 6. Inclusion property is also applied..However. Appropriate snoopy protocol extensions are the key issue for these systems. So f a r .. . Following the inclusion property. . In order to maease memory bandwidth. and write-invalidate (modified Berkeley) protocol is used on the global level. In strive for more processing power. snooping and directory schemes. counting that it is sufticient in a vast majority of cases. the DASH supports a more relaxed release consistency model in hardware.. It is connected to both row and column busses and is also responsible for coherence maintenance by snooping on both busses. Stanford‘s DASH distributed shared memory multipmessor system is composed of common-bus multiprocessor nodes linked by scalable inkrw”ect of general type [LenogO]. in order to be storage efficient. The protocol incorporates selected distributed features of snoopy protocols and consists of an &emd and a global portion. a somewhat different coherence solution.clusterbus LM .. which depicts the ability to support efficient€ylarge-scale shared memory multiprocessor systems. is proposed in [DeWa90]. Each node includes dedicated two-level caches. Among other means for memory access ophiiation... Snooping principle maint a i n s coherence of caches on individual busses. . while directorymethod is usedtopreserve coherence among busses. . the schemes that combine principles and advantages of both snoopy and directory protocols on different levels seem to be highly effective for these systems.A firesuent sohtion networh and --based tion is to organize systems into bus-based clusters or subsystems which are connected with some kind of n e t w o r k Canss quently. j ’ : i .globalinterconnectionnetwork cc .Hierarchical write-invalidate ptocol is used cache coh~ ~ l ~ twhich i o n fuUy supports data miption Clusters are connected to hlgher levels through data controllers. as a part of an extended invalidation protocol for pnxerv@ coherence._____... j c . while caches on higher levels are shared. Very large processor caches implement virtual shared memory.

1989. Rudolph L.R.- ‘%p1-3 87 1 . A fast trap mechanism provides support for this feature.J.” Proceedings o f the 12th ISC4. pp.“TheWisumsin Multicube: A New Large-scale cachecoherent Multiprocam.’’ lEEE Tronsactim on Copnp~ters. [Chai91] Chaiken D. Kubiatowitz J.V0l. June 1990. Vol. 1986.. 280-289. June for M u l t i ... “An Evaluation of Directory Schemes for Cache Coherence...’’ IEEE Computer. 124-131. IEEE” Computer.. 164-169. “On the Inclusion Properties for Multi-level Cache Hierarchies. Berkeley. Laun%i. Briggs F. pp 74-77. 12-24. Vo1. ‘‘Spchr(mization with Multiprocessor Caches..” ComputerArchitecture News. pp. The great importance of the problem h e applied solution on system perand the strong impact of t formance puts an obligation on the system a r c h i t e c t s and designers to carefully consider the aforementioned issues in their pursuit t o w a r d sm o r e powexfd and more efficient shared memory multiprocessors. D. Agarwal A. “An Econormcal Solution to the Cache Coherence Problem. [Kar186] Karlin A. Vo1.. [Good831 Good“. Perkins C. [Cens78] Censier L. “Using Cache Memory To Reduce Pnx. pita861 Bitar P.” Proceedings o f the 27th Annual Symposium on Foundotim o f Computer Science. “Synchronizati~~~.” Proceedings of the PARLE 89. [Arch851 Amhibald J. an intenupt is generated. 1976. [Egge89] Eggers S. ‘’Competitive Snoopy Caching.” Proceedings o f the 17th I%.. Gharacholoo K.June 1990.23.” Computer. b.” Electronics.. ence Solut~onfor Multip~wessorswith Private Cache Memories.21. ” Report No.: % U 89/5Ul. “An Em i r i d Evaluation of Two Mem -Efficient Directory M&&. ceedings o ” 8 1 Dubois M.pp. May 1988. Slngh A.” Evaluation U ACM Tmnsactim on ComputerSystems. m 8 9 ] Haridi S. 1985. 422431. pp. [Arcd$[h=J. ‘‘Aquarius Project.H. 1986.Innovations. . 7. Hmwitz U. S t e w a r t L. 1989. G.. pp.. [Tang761Tang C.4. “Scalable Coherent Interface.April 1989. m 0 9 0 ] Lenoski D. 414-4423. L. 1..” Pmeedings o Computer Confmce. pp. 1984.. Much of the w o r k in developing. [Sten90] S t e ” P. 337-345. essor-Memory Traffic.1991. L.. 1983. * . 1989. Woest P. implementing. [Arch87 Archibald J. Scheunch C.. 80-83. N0. 1988. University February 1987. Despite of the considerable advancement ofthe field.In those idkquent cases. paex881 Baer J. B. pp. J. 1985.” Pmceedings o f the 16th BCA... “A Class of Compatible Cache ConsistenCy Protocols and their Support by the I E E E Futurebus. Doubtlessly.pp. J. when more pointers are needed. 348-354. U. Ramachandran U. 1988pp..” IEEE Computer. 138-147. “Firefly: A 8 1 Multiprocessor Workstation.“A Low-overhead Coherpapa841 Papqarcm... Hennessy J... 1990. Eggem S.. 57. Early Overview. 224-234. L. Baex J. pp. so we decided to provide a list of papers for additional readmg.” PhD Thesis. “A Survey of Cache Coherence Schemes IEEE Computer. pp. it still represents a very active research area. “Simulation Analysis of Data S Shared Memory M u l t i .. [Chai90] Chaiken D. WcC1-841 McCreight E. pp. Conclusion This survey tries to give a comprehensive overview of hardwmbased solutions to the cache coherence problem in shared memory multiprocessors... 36-49. University of California.S. ”LhitLESS Directories: A Scalable Cache Coherence Scheme. November 1986. July 1984.2... Mauasse M. Despam A. “Tightly Coupled Multiprocam System Speeds Memcny-access Times.6.1990.... W.23. Vo1.. January 1984. Satterthwaite E. Baer J.4.. . Agarwal A:.4. Laudon J. pp.pp. this topic deserves much m o r e time and space. Sheldon R. ” 1990. Wang W. February 1988.S.” Proceedings o f the 11th ISCA. Sleator D.. Evolution..” Proceedings o f the 13th ISCA. 207-214. Feautrier P. ‘?)lrectory-Based Cache Coherence m Larg&e Mulb j ” .” Proceedings o f the 11th iSCA.12. “A New Solution to Coherence Problems in Multicache Systems. “The Cache Coherence Problem in Shared-Memory Multiprocessors.. 348-354.June 1986. ‘The Cache CohexemeProtocol of the Data Diffusion Machine. ..” Procdings o f the 16th ISCA. K. N0. pp. June 1990.” Proceedings o f the 15th lSC.. and Event ordering in Multiprousors. pp. ‘“le Directmy-BasedCache Coherence Pm tocol for the DASH Multiprocessor. 355-362. 1989. Pud0841 RudofphL.. 2-14. V01. A.. are handled in so% w a r e .. December 1978. February 1988. pp. [Swea86] S w e a z e y P. pet901 Prete C. K e K. and evaluatingthe solutions should be done with sigdicant prospective benefits. 5749. “A S u e CODV Data Coherence Scheme for Multi~rocessorSvs &... m 9 0 ] Lee J. D. p+an84] Frank S . [Arch861 A r c h i d J .pp. [Egg&9a] Eggers S. [Good881Goodman J. and a full-map directary for the block is emulated. Smith A. Pradhan D.” Proceedings o f the 10th ISCA. W a Cache Consistency Protocol. M. Italy.”Micropmessing mdMicr0 rogm”ing 30.... pp. Hemessy J. 749-753. G~essing. 148-159..23. [OKra90] O’Kdka...6. 49-58...pp... pp. No. pp. on a number of merent subjects related to the general topic of cache coherence in multiprocessor systems.. J. pp.1. “Multiprousor Cache Synchronization: Issues..” Proceedings o f the NATO Advanced Study Institute m Micnxamhitecture of YLSI Computers. N0. Patel. [ J d ] James. 8.” h e e d i n g s o f the 17th ISCA.“Cache System Design in the Tightly Coupled f the National Multiprocessor System. 27-37. Gupta A.. SegaU Z. 273-298. [Katz85]Kak R. ‘’Dynam~c Decentraked Cache Schemes for MJMD Parallel Processors. References [Agar891 Aganval A..-?T.. Sohi.” Proceedings o f the 13th ISCA. “Cache Coherence Protocols: s i n g a Multiprocessor Simulation Model..” Proceedings o f the 17th ISCA.1990. 1990. “The Dragon Computer System.Vo1..C-27.” Proceedings o f 12th I CA. pp. “A Cache CoherenceApproach For Large f the SupercomputMultiprocam System. “A New Solutionof Coherence Protocol for Tighly Coupled MultipIwessor Systems. N0. Hagaskm E. Despain A. o o d D. “Evaluating Performance of Four Snooping Cache Coherency Protocols.6. Urbho.. mend891 MendelsanA.”Proceedings o ing Confeerence. 414-423. V. J.” Prof the 18th ISCA.. [Carl901 Carlton U. 424-442. N0..1984. 1112-1 118. pp. 244-254. Kalz R.” Proceedings o f the 15h ISCA. pp.” IEEE Micro. Newton A.stances. Fields C. Thacker C.. 9-21.276-283. Simoni R..

Karlov S.23. 186-19s. ‘Camparison of Hardware and Software Cache Coherence Schemes. June 1989. [Smits2] Smith. .. . pp.Vol. . Y e n D.373-383 @Zggg9] Eggas S. 1987. Y.. 243-256. L.. 9 p@8] Eggers S . C.. 230-242. “Cache Coherence for Sbared Mem Multilnmcesm. It gives a deeper insight into the cache c o b p b l m in shared memory multiprocessors. 1987. Lazowska. “An Apjxoach to ! Software Cache Consistency Maintenance ased on Con&tional Iuvalidation. Gertner L. 9 ad Memory Referenw. Hill I d .May 1988.” proceedings of the January 1992. 155-160. J.1- Thakkar S .. ‘‘Auto~natic Managemeat of Programma le Caches. [Adv&l] Adve S .” Proceedings of the 3rd ASPWS. S Juue 1989. pp. Shem B. 140-148. pp. May 1986. of Cache-Based Multl I S . July 1990. ~slpta A.” Computer.3. pp.. January 1992.6. Holliday U.. R.Katz R. The papers cover a broad range of topics in the general area of cache cohereme. 72-80." Joumal ofPamlle1 andDisaibutedComputing. pp. pp.. [wang89] Wang W. p d ] V e ” U. .Generalized Timed P e t r i Nets. Cache Ccw 9 ] Min S. pp. D e w B. January 1983. 229-238. pi^?] weber w:D. e n W. [wood891Woodbury P. Briggs F. . “Analyans 3 c a c h e Iuvalid a h mP a t t e r n sm M u l ~ s s o r s . prdocols and Performance. Milutinovi6 V .. [Toma921Tcamkvi6. peh831 Y e h C. “Shared Memory Multiproces5ors: The Right to Parallel processing.S of Software Cache cohe”. “Evaluating the Perf[Owic89] ow1cki. June 1990. 298-308.” S p h l hcahty IIL Mdtl 9 1 Technical R e p r t m -397.” IEEE Computer. N0. L. Chen P. analytic and simdation models of p”tocols.” Proceedhgs ofthe 3dASPWS. August 1988.A. .McAuliffe.Bartllet J.. Suggestions for further reading Here is provided an additional list of rehence (not ment i o n e d in the paper because of space restrictim). C-32(1).” Cmputing Surveys. 6. 322-330. 33-46.’Orgmmhon and Performance of a TwpLevel V i r t u a l R e a l Cache Hierarchy.” Pmceedings of the 2nd Annual y ” on Pamllel Algorithms and ArchitecACMY turn. 72-81. pp. parallel program characteristics. September 1982 473-530. Baer J. 123-132: . A r a l Z. [Sites81 Sites R Agarwal. [Cheo89] Cheong H.. N0. Apnl1989.pp. .” Proceedings of the Petfomance’86andACMSigmetrics 1986. E. ~ $ 7 G. “A Version Control A p preach to Cache Coherence. W. Adve V. P a J. “OfSharingin P a r a l l e l Prqgrams and its Application to cohaency Prow col Eval~on. M. 9. 78-80. 153-161. F e w 1991.” Proceedings o f the Pacific Computer CormtunicationsSymposium. W w W. “A Simulation Study Cohexnce Protocols. [Dub082 Dubois U. ‘‘Scalable Shared-Memoq Multip~~cesm Architechres. pp. ”The Effect of S h a r i n g on the Cache and Bus Performance of Parallel Programs. June 1987. 24. Davidsor~ E. Veidenbaum. ”Data Coherence Tyen851 Y System.” Pmeedings of the Spring cnnpmn’89.May 1988. Boyle P. Baer J.. pp. 56-65. “Shared Cache for M u l ti le-stream computer Systems. “An Accurate p~xn881 and Effiaent Perfinmance +ym Techque for Multup Protocols. 14. A. [ W w qWilson A. 3847. [Good871 Goodman J..”Pmeedings o f the 16th ISCA. pp. 9-17. --?? 872 .. Dubios M. 257-270. Milutinovik V . Goosen R. [Smit%S]srmth A. Vo1.L. October 1987.11. “A Cache Coherence Protocol for Multiwith Multistage Networks.” Pmeedhgs of the 18rh ZSC4.” Z E E E Tmnsactim on C%nputers.” Proceedings of the 1988I P P . April 1989.Agarwal. [Sten891 Stenstrom P. 457-466. Sohi G. 427436. part921 T d j a I. Seoul.. “A Ti”pBased herence Scheme.” Proceedings of the 25th HICSS.1991.. Proceedings ” ofthe 3rd ASPWS. ma^.” Proceedings o f & 15th Z U % 3 0 % 3 1 S . “Correct Memory operation [%he871 Scheunch C. November 1982. pp. pp.? Gupta.. pp. “C&ache Consistency w i t h Software Support and Using One-Time Identifiers. V0l... .. H. performanceevaluation and cOmpafiSOlZetc. ” Pmceedings of the 14th ISC4. Fu D. pp. . [ThaP90] Thapar U.2.. pp.”Proceedingsof the 15th E T U .. J. 1083-1099...L. “COfor Mdtl-r Virtual Caches. Vermm M. .S ...“Perfonn8nce Asalysis of M u l t i 7 cache Consistency Protocols U .1989.. [ C h d l ] Cheriton D.’’ Proceedings o f the 1989 ICPP. Verqaq M. L..A. 1989... ! . [Cytrs8] Cytron R .. pp. . A. N0. “A C . “Multile~l Cache H i m chi= oqmzat~ons.. “Paradigm: A H q h l y Scalable Shared-Memary Multicomputer Archite~ ture.Katz R. pp.. W * il CaChdBlls A l C h i for Shared Memory Multlpr~~e~sors..(2-34. April 1989.” 1. ”Cache Memories. Levy R. ‘Effects of Cache Coherency i n M U/ tip”. Dupois U.1989. 407415.” IEEE Tronsoctons on Computers. G.. [Agar891 -. pp. Jan~ary 1985. pp. . Wilson A..C-31.Laundrie A.” LlGW TmnsoCron on Problem in A Multi& CompurerS.”Pmeedings of the May 1989.” proceedings o f the 14th ISCA.. 244-252.pp... Baer J.A. pp.” Proceedings o f the International Conjknce on Supercomputing 89.451476. “Multi Cache Analysis Using A * Procedings 15th LSCQ.” Proceedings o f the 2nd ASPLQS.

Tomasevic and V.A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors M.Milutinovic Please see page 863 for this paper 496 .