P. 1
A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors

A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors

|Views: 1|Likes:
Published by Amit Singh
cache coherence maintenance
cache coherence maintenance

More info:

Published by: Amit Singh on May 22, 2013
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/22/2013

pdf

text

original

A SURVEY OF HARDWARE SOLUTIONS FOR MAINTENANCE OF CACHE COHERENCE IN SHARED MEMORY MULTIPROCESSORS

Milo TomaievZ Department of Computer Engineering Pupin Institute POB 15 11000 Belgrade, Yugoslavia email:etomasev@yubgefi1.bitnet
Abstract
Appropriate solution to the well-known cache coherence problem in shared memory multiprocessors is one of the key issues in improvingperformance and scalability o f these systems. Hardware methods are high& convenient because of their transparency for sofhare. They also offer goodperfonnance since thty deal with the problem filly dynamically. Great variety of schemes hm been proposed; not mrmy of them were implemented. Two mdor groups of hardware protocols can be recognized: directory and snoopy. This survey underlines their principles and summarizes a relatively large number of relevant representatives j?om both groups. Coherence problem in multilevel caches is also briejly considered. Special attention is devoted to cache coherence maintenance in large scalable shared memory multiprocessors.

Veljko Milutinovi6 School of Electrical Engineering University of Belgrade POB 816 11000 Belgrade, Yugoslavia email:milutino@pegasus. ch
bandwidth. However, t h i s solution imposes another serious problem. Multiple copies of the same data block can exist in Merent caches, and ifprocessors are allowed to update freely m m i their own copies, inconsistent view of the memory is i nent, leato program malfunction. This is the essence of the well-known cache coherence problem [Dubo88]. The reasons for coherence violation are: shanng of writable data, process migation, and YO activities. System of caches is said to be coherent if every read by any processor h d s a value produced by the last previous write, no matter which prowssor performed it. Because of that, system must incorporate some cache coherence mainhaace mechanism which consists of a complete and consistent set of operations (for accessing shared memory), preserves coherent view of the memory,and ensures program execution with correct data. Cache coherence problem has attracted considerable attention through the last decade.A lot of research at prominent universities and companies has b e e n devoted to that problem, resulting in a number of proposed solutions. Importance of the problem is emphasized by the fact that not only the cache coherence solution is necessary for correct program execution, but it can have a very sigmiicant impact on system performance, and it is of utmost importance to employ cache coherence solutions as efficiently as possible. It is firmly proved that the efficiency of cache coherence solution depends to the great extent on system architecture parameters and especially on parallel p r o m characteristics P g g e 8 9 1 . Choice of apo r e critical in large multiprocespropriate solution is even m sors where some inefficiencies in preserving the coherence maintenance are multiplied, which can seriously hurt the scalability.

1. Introduction
Multiprocessors are the most appropriate type of computer systems to meet the ever-increasing needs for more computing power. Among them, shared m e m o r y multiprocessors represent an especially popular and efficient class. Their increasing use is due to some sigmiicant advantages that they offer. The most important advantage is the simplest and the most general programming model, which allows easier development of parallel software, and supports efficient shanng of code and data. However, nothmg comes for fiee. Shared memory multiprocessors sufFer from potential problems in achieving high performance, because of the inherent contention in accessing shared resources, whch is responsible for longer latencies in accessing shared memory. htrodwtion of private caches attached to the processors helps greatly in reducing average latencies (Figure 1). Caches in multiprocessors are even more useful then in uniprocessors, since they also increase effective memory and communication
This research was partially sponsored by the NCR Corporation, Augsburg, Germany, and the NSF of Serbia, Belgrade, Yugoslaia The EhDOT simulation took were donated by the ZYCAD Corporation, Menlo Park Calfomia. USA. and the TDT Corporation, Cleveland Heights. Ohio, USA. The MOSIS compatible W design took were provided by the T m Corporation,Pasadena. Calrfomia, USA.The W design took were provided by the ALTER.4 Copration, pm Sonto lara. California, USA.

A 4
C P

- memorymodule - privatecache
- processor
interconnection network

ICN-

6

6

??

~

proceedings oftheHawaii International Confirerice on system Sciences, K o k . Hawaii, USA.,January 54.1993.

F’igure 1: A shared memoxy multiprocessorsystem with private caches

863

0-1060-3425/93 $03.00 @ 1993 IEEE

and X is B and kB. The global directory can be orgamzd in several ways.Upon the individual requests of the local cache controllers.limited dmctory. An enby points to exact locations of every cached copy of memory block. They promise better performance.Besides the global drrectory maintained by the central controller. . different examples are presented. etc. 3. compiler. so it seems that this classification will be of less use in t h e hture. Software schemes are generally less expensive than their hardware counterparb. 2. Although they require an i n d hardware complexity. all solutions to the problem can be classi6ed into two large groups: software-based and hardware-based [StenW]. Whenever appropriate. Directory protocols The main characteristic that distinguishes this group of schemes is that the global system-wide status information relevant for coherence maintenance is stored in some kind of directory. Due to aforementioned reasons. Paper [Agar891 introduces one useful classfication of directory schemes denotmg them as Dir. cache coherence in multilevel hierarchies is discussed. 0 V a r i o u s proposed hardware schemes efficiently support the 111 range fkom small to large scale multiprocessors.X. Hardware solutiom We have focused our attention on t h e hardware-based solutions. Acmrdmg to the most widely accepted classification based on the first and the second criterion. becauseof space restrictions. respectively. The responsibility for coherence preserving is predominantly delegated to a centralized controller which is usually a part of the main memory controller. where inevitable inefficiencies are incurred since the compiler analysis is uuable to predict the flow of program execution accurately and umservative assumptions have to be made. where i is the number of pointers. This survey follows one specific organizational structure. and impose no restrictions to any layer of software. and chained directory schemes. and indiscriminate or selective invalidation. . directory methods can be divided i n t o three groups: hll-map directory.). Full-map directory schemes The main characteristic of these schemes is that the directory is stored in the main memory. hardware cache coherence schemes can be principally divided into two large groups: directoq and snoopy protocols. Meanwhile. compafed to the system costs. in a mutually exclusive way.Technology advances made their cost quite acceptable. it is not exhaustive. or between caches themselves. The simplest but the most restrictive method is to declarenoncacheable pages of shared data. W h o is responsible for preserving the coherence in the system.1. hardware protocols free the programmer and compiler h m any responsibility about coherence maintenance. and numerous solutions belonging to both groups are surveyed. centraked mtroller checks the directory. hardware cache coherence protocols are much more investigated in the literature. Special cache managing instructions are o h used for cache bypass. Each approach to be considered here is briefly introduced. Decisions about coherence related actions are often made statically durmg the compiler analysis (which tries to detect conditions for coherence violation). some disadvantages are evident. although t h e y may require considerable hardware support. flush. scalable multiprocessors. private caches store some local state information about cached blocks. !Wtware solutions Software-based solutions generally rely on the actions of the programmer. Essential characteristics of existing hardware schemes are reflected in respect to the following criteria: Where the status information about data blocks is held and how it is orgamzed. especially in static schemes. M o r e advanced methods allow the caching of shared data and accessmg them only in critical sections. basic principles are presented. Cache coherence solutions Basically. for broadcast and no-broadcast schemes. in order to maintain coherence. employing of previous methods in large scalable multiprocessors with cache-mherent architectures is considered.2. It is also responsible for keeping status information up-to-date. since the proposed solutions (so far) follow fully or predormnantly one of the two appaches. There are also some dynarmc methods based on the operating system actions. however. Amto that issue. 2. 0 What kind of coherence actions are applied. or operating system. On the other side.2. and issues necessaq commands for data transfer between memoxy and caches. their cost is well justified by siguificant advantages of the hardware-based approach . and also much more fiequently implemented in commercial multiprocessor systems. solutions using combination of hardware and software means become more hquent and promising. especially for higher levels of data shanng. in deidmg with the coherence problem. so every local action which can affect the global state of the block must be reported to t h e central cont r o l l e r . Which write strategy is accepted. Which type of notification is used. Us- 864 . and keeps its status.Being totally transparent to software. Then. 3.Hardware s c h e s deal with coherence problem by dynamic recognition of inconsistency conditions for shared data entirely at n m t i m e . Fmally. This survey tries to be as broad as possible. After that.It is also claimed that they are more convenient for large. Software-based solutions will not be elaborated upon any further in this paper. etc. This is a traditional classi6cation that still holds. since the coherence overhead is generated only when actual shanng of data takes place.1. In the next two sections. Directory protocols are prrmanly suitable for multiprocessors with general interconnection netWorks. and contains entries for each memory block. the following points of view will be underlined: 0 Research environment Essence of the approach 0 Selected details of the approach 0 Advantages Disadvantages 0 Special comments (performance. usually called cache coherence protocols.

The main advantage of the U-map approach is that locating necessary cached copies is easy. They replace t h e presence bit vector with a small number of identifim pointing to cached copies (Figure 2b). This concept is justified by findings of some studies [Webe891 that t h e number of simultaneously existing cached copies of the same block is usually small. the scheme has to broadcast t h e coherence command when mcewq. Instead of using an army of pointers. A very similar directory scheme is described in [Yen85]. t directory schemes. only two bits are used to encode the state of each block Since exact identities of caches umtaining the copies of block are not known. but cache entries haw an additional bit per block. One example of an extremely storage efficient m e t h o d is t h e “two-bit scheme” [Arcb85]. Condition for storage efficiency of limited directory schemes over m-map directory schemes is given by ilo&N< N.ing this inf-tion. and N is the number of processors. Directory contains entries for each memory block in t h e form of a bit vector (Dir.datablock P -pointer Y .numberofprocessors A 4 c -cachecacheq c) Figure 2: The examples of three directory organizations: a) “ a p directory organization b) limited directory organization c) chained dmctory organization 865 . ~tionoflimiteddlrectoryschemesisdescribedin [Agar89].duty bit CT. Entries in l i m i t e d directories contain a fixed number of pointerS. Directory search is not easy. whi& is set in t h e case of clean exclusive copy.chainterminator N . The Grst protocol ftom this class is developed in [Tans761 and later impleme&d in IBM3081. The proposed sectored scheme reduces directory size by increasing t h e block size. Special actions have to be taken when the number of cached copies exceeds the number of pointers. too. There are couple of dwdvantages. but provides a [Cens78] applies t much more efficient directory search. The most serious problem is s@cant memory overhead for U-map directory storage in multiprocessor systems with a h i s large number of processors. information in memory directoly and i m e . 2 . however. MemOrydireCtoIy is orgamedas a set of copies of all individual cache directories. it can become a p e r f m c e bottleneck. because t h e y can invalidate all copies using a broadcast si& when necessary. Also. avoidmg u s u a l l y expensive broadcasts.NB scheme). These protocols put an upper limit on the number of simultaneously cached copies of a particular block. The protocol is easily expandable w i t h o u t further modifications. Size difference between a full and a l i m i t e ddirectory. but onlyanemodifiedcopy. It use^ an invalidation approach and allows the existence of multiple unmodified cached copies of the same block in t h e system. . and one bit which denotea whether t h e block i s modified in one of the caches (Figure 2 a ) . to Eree t h e pointer for a new cached copy (Dir. because duplicates of all cache dhctolies .where i is the number of pointas. One approach to alleviate t problem is presented in [OKra90]. Also. but implies t h e burden for maintainingcorrect value of this bit. Centralizd controller is innexible for system expansion by addmg new processm.Local cache dkctories store two bits per cache block (valid and modified bits). cohof data in private caches is maintainedby directed m e s s a g e s to known locations. NB schemes).sharedmemoq X . Such an organization implies some serious disadvantages.is significant. M e m o r y directory is identidy orpmad. Limited directorg sheMotivation t o oape with t h e problem of “ay overhead in full-map direCtoIy schemes led to centdized schemes with partial maps or limited directories. cache directorieshas to be consistent all the t Classical full-map directory scheme proposed in h e same coherence policy.Since all requests are directed to the central dmckny. 3 .Validbit D . and only caches w i t h valid copies are involved in coherence actions for a particular h e y deliver best perfbrmance of all block Because of that. The schemes with brotldcast capability allow that situation. these schemes are not scalable for several reasons. If protocol disallows broadcasts. This eliminates the need for directory access for write hits to umnodified private blocks. one copy has to be invalidated. Directoly entry consists of N + 1 bits: one presence bit per each of N processorcache pairs. for small i and large N. mustbecheckedwhende&“q the status of a particular memory block. while choosing subblock as the coherency unit. perform- .

it is fkd. On a mite request.Entry in main memory i to the head of the list and to keep the block s t a b . by updating the pointers for both. Write-invalidate snoopy protocols Write-invalidate protocols allow multiple readers. An ownership based protocol is developed for the Synapse N+1. and the tag is reallocated The scalability of limited directory protocols. Because of better hand l q of the replacement situation.The action on first write to a shared block is the same as in the WTI protocol. On a read miss. Every write to a shared block must be precsded with the invalidation of all other copies of the same block. and sulxsequent commands from the memory controller are usually forwarded throughthe list. Coherence protocol can also be bypassed for private data. Snoopy protocols In t h i s p u p of hardware solutiom. Main m e m o r y contains the pointer to t h e head of the l i s t for a particular block. doubly linked lists perform slightly b e t t e r COIllpared to s+y linked lists. and a large tag is allocated. Unlike two previous directory approaches. distxibukd directories. however. Entries of such a directory are orgamzd m the form of l i n k e d lists. An solution which combines paltlal and full map approaches is given in [OKra90]. Cache ownership ensures cheap local accesses. which imposes some reactions ( a c c o w to the utili z e d protocol). It is important that the approach does not limit the number of cached copies. An entry is added to the head of the list. Chained directory schemes Another way to ensure scalability of directory schemes. Snoopy protocols are ideally suited for multiprocessors which use shared bus as global interconnect. Another very simple scheme. Tags are stored in two associative caches and allocated when needed. cheap local writes can proceed u n t i l some other processor requires the same block. The Wnte-Once protocol was the one of the first mite-back snoopy schemes. Another improvement refers to the ability that read misses can be serviced by cache which holds the ‘‘duly” block. Because of that. inww 4. Some optimizations are incorporated into the coherence mechanism. A bit for each m e m o r y block determines whether memory or cache is the ownez of the block. hibits c Every read and write miss takes over the copy and makes it private. the replacement problem is alleviated because an e n t r y can easily be dropped by c h a m q its predecessor and its successor. in terms of m e m o r y overhead for directory storage. a fault tolerant multiprocessor system [FranM]. the new entry and the old head entry. chained directories can be organized in the form of either slngly or doubly l i n k e d lists. However. but when small tag b e c o m e s insufticient. u n t i l some other processor requests the block. 4. They are also known as very costeffective and flexible schemes. using the pointem. f o m the WTI protocol. and the last m e m b e r of the list cantainsthe chain terminator.theirperfannancehea~dependsonsharingcharacteristics of parallel applications. in order to preserve coherence. and the read request resubh e n . Once the block is made exclusive. read m i s s is inefficiently handled. with respect to tag storage efficiency. the centralized controller and the global state information are not employed.Another disadvantage is that write hit to shared mitted t 866 . Local cache controllers are able to snoop on the network. and to recognize thd actions and umditim for coherence violation. This mechanism is costeffective for applications with prevalent sharing in the form of migratory data and synchronization variables. proa m more t h a n one copy of the block Wend891. It was intended for single-board computers in the Multibus environment [Good83]. where The approach originated r write-through to memory on system bus is followed by invalidation of all copies of the involved block. a small tag is allocated to the block. at the expense of bemg more complex and using twice as much storage for pointers. Two write policies are usually applied in snoopy protocols: writeinvalidate and write-update (or write-br0aW). only systems with s d to medium number of processors can be supportedby snoopy protocols.ance degradation for higher levels of sharing and a greater number of processors can be intolerable. the SCCDC protocol. all the actions for the currently shared block must be announced to all other caches. and prevents race CondifioIIS. This simple mechanism can be found in some earlier machines. is the introduction of chained. while perfoxmance is almost as good as in fuU-map direotoly schemes [ChaiW]. tions on t and make the bus saturation more acute. When cache is the block owner. Replacement of chained copy in a cache is handled by invalidation of the lower part of the list. One of the most important is the possibility of combining the requests for list insertions. When no tags are h e . Distributed directories implemented as s@y linked lists are used in the Stanford Distributed-Ihctory Protocol [Thap90]. where all caches shanng the same block are chained through pointers into one list (Figure 2c).1. cache controllers and distributed local state information. Requests for the block are issued to the memory. The main advantage of chained directory schemes is their scalability.invalidation signal is forwarded h m the head through the intermediate members to the tail. In this way. Consequently. Distributed. Such an m o phi~ti~ated coherence solution results in p00r system performance because of a very high bus utihtion [Arch86]. one of used tags is selected. but only one writer at a time. Only systems with a very small number of processors can use this scheme. since the shared bus provides very inexpensive and fast broadcasts. Each list entry has a forward and a backward pointer. Sub‘ d i r t y ‘ ‘ state) proceed losequent write hits to that copy (in ‘ cally. because m e m w has to be updated first. leaving the block in the specih ‘‘reserved” state.3. Completion of actions is acknowledged with reply messages. Small tag contains a limited n u m b e r of pointers. A distribUted directory with doubly linked lists is p r ~ posed in the SCI (Scalable Coherent Interface) project [Jamego]. is quite good. via broadcast capability. On the first reference. while large tag consists of the full-map bit vector. Tags of two Wkrent sizes are used. the requester is put on the head of the list and obtains the data fhm the previous head. Coherence maiotenanCe is based only on the actions of local 3. cormblock is invalidated. preventing the use of stale copies (Figure 3a). coherence ach e shared bus additionally increase the bus traffic. chained directories are spread across the ins used only to point dividual caches.

validbit . PlADX c. a memory update takes place. Efficient cache-bcache transfers without memory update are enabled by the existence of the new “shared duty“ state. -X G a) F’igure 3 W r i t e strategy in Snoopy protocols: a) invalidationpolicy b) upaate Policy A 4 . so space locality in programs can not be exploited in a proper way. when fetchmg a missed block. which is the increase of cache inMerence reflected in high processor lockout f r o m the bus ~1~939ai. and update their own copies. Many of the previously mentioned useful features are included in the EIP ( E f f i c i e a tInvalidation Protocol) [Arch87l. A more sophisticated version is also proposed.data produces the same action as a write miss. The k k e l e y protocol. since there i s no invalidation s@.. It makes handling of read and write misses to most unmodified blocks m o r e efficient compafed to protacols in n oo. This read bmdcast cuts down the number of invalidation misses to one per invalidated h e block status. Private data are better handled. implemented in the SPUR multiprocessor. and to supply data.. unless the block is replad later. but improves significautly over the previously described protocol Dk851.On read miss... 0 - c.invalidationsignal 867 . t h e r e is a negative side effect. When no duly block owner exists.distributedwrite P . which uses so& ware hints to distinguish between loads of shared and non-shareddata. a clean block owner is defined. I V 0 - V 1 V r . also applies the ownership principle. eveq cache with a valid copy tries to respond. as the l a s t cache that experienced read miss for the block. When some cache issues bus read. the shanng status on block The problem of rec+zhg load is solved in the I h o s protocol [papa84].t h i s protocol proposes another new feature-clean cache ownership. the new “exclusive unmodihd” state is intrcduced. The of the protocol is its inability to recognize main a possible exclusive use.. In spite of the general uselkhess of the r e a . since the protocol entirely avoids invalidations on Write bits to unmodified non-shared blocks. Besides duty owners. Simultaneously with the cache-tocache transfer. To this end. a l l caches with invalidated blocks catch the data available on the bus. block. If the block is modified. The CMU RB protocol presented in [Rudo84] tries to solve the main obstacle for perfbmance improvement in write-invalidate protocols-invalidation misses-by introducing block validation.. only one cache responds. Only three states are used to spec^ t Blocks with one word length are assumed.sharedmemory C -cachememory X . This requires correct arbitration on the sbared bus.upaateddatablock Y .datablock .d * t action. which can slow down the action.

in the effort to achievetheoptimalperfomance. upon each processor operation or bus transaction. which is implemented in the DEC’s Firefly multiprocessor workstation. In t h i s way. Only three states are sufficient to describe the block status. since they attempt to adapt the wherence m nisms to the obserwd and pmhcted data use. especially for higher rates of shared ref€2znc€s [Arch86]. It introduces one bit with each block to disuiminate copies referenced by the current and the previous process. as well. Coherence overhead stays within the h c t o r of two. the invalidation signal for t h e block is sent. A fixed criterion is set for swipolicy to invalidate policy. which is developed for the Dragon multiprocessor workstation fiom X e r o x PARC. 43. This l i n e is activated on the occasion of distributed write. making the own copy exclusive. AAer the invalidationthreshold is reached the whole block is invalidated. Block status is defined with eight possible states. a strong need for standardized cache coherence protocol has appeared The MOESI class of compatiblecoherence protocols is introduced in [Swea86].The main difference with the previous protocol is in t h e memory update policy. On the second successive write to a block by the same processor. performance can be seriously hurt with ikquent tmwcewq write broadcasts. The word to be written to a shared block is broadcast to all caches. Write-update protocols usually employ a special bus line for dynamic detection of the shanng status for a cache block. either as originallywith some adaptations. Dragon does not update memoiy on distributed write. Whether this approach bnngs a performance improvement or not is highly dependent on the sharing pattem of the particular parallel application. The MOESI model enwmpasses five possible block states. write-through is used for s h a r e d data.Additional bus l i n e is also requiredfor transferring of the dirty ownership. A Dragon-like protocol is proposed in [Pret90] which tries to cope with this problem. which gradually recovezsthe block Write miss on the only invalid word in a padally valid block is also optimized. The first write to shared block is a write-through. Implementatian of the E I P requires an increased hardware complexity. neither of the two appmaches is able to &liver superior perfinmance across all types of workloads. Memory copy is also updated on word broadcast. some unnecessarywrite broadcasts can be avoided and effects of cache intedkmme reduced.which only memory can be the clean owner. A typical protocol fkom this group is the Firefly [ThacBS]. When shanng of a block ceases. and ownership statuses. since the invalid state is not requmd. that block is marked as private and the Wbuted writes are no longer necessary. o s t of the existing exclusiveness. wmpared to the cost of the optimal off-line algorithm. Write-update protocols Write-update schemes follow a d i s t r i b u t e d write a p p c h which allows existence of multiple copies with write permission. called used and unused copies. the WIP (Word Invalidationprdoool) employs a partial word invalidation [Tama92]. an invalidation signal is sent onto the system bus. In this way. Avoidmg hquent memory updates on shared write hits is the source of performance improvement over the F i r e f l y protocol. is the essence ofthe competitive snooping approach [Kar186]. whenever more than one cached copy of the block exists. but when a longer sequence of local writes is encountered or predicted. other copies need not to be invalidated if used only for reading. Since a wide diversity of wheremx protowls exist in this area. The combination of invalidate and update policy.2. After three distributed writes by a s q l e processor. Instead of the. as well as t h e i r implementation details. This is the reason why some protowls have been proposed that combine invalidation and update policy in a suitable way. according to the validity. because each cache in t h e system may employ a different coherence 868 . This criterion seems to be a reliable indicator of local block usage. A very similar wherence maintenance is applied in the Dragon protocol FlCcr841. In case of sequential sharing and process migration. unintempted with a reference by any other processor. The owner is the last cache that performed a write to a block The only situation which requires memory update is the eviction of the owned block from the cache. One of the first examples of combined policy can be found in the RWB protocol [Rudo84]. and t h e n switches to write-invalidates. It allows the existence of invalid words within the prhally valid block. M or protocols m in this class. and caches containing that block can update it (Figure 3b). u n t i l the pollution point is reached. 4. The EIP protocol also applies the concept of block validation as in the RB protocol. Allowing the processes to be switched among processors promices false shanng when logically private data become physically shared.which is an enhancement of the former R E 3 protocol. This class is very flexible.valid block is serviced by a shorter bus read word operation. and write-back for private data. resulting in the need for clean owner (in respect to other cached wpies) described by an additional state. Performance studies show that the write invalidate protocols are good choice for applications characterized with sequential pattern of shanng. These solutions can be regarded as e c h a adaptive. In that situation. accordingto therewpmdwrite access pattem. while fine-grain sharing can hurt their performance a lot [Egge89]. improving the efficiency of miss service. No clean owner exists and all caches with shared copies respond to read miss. On the contrary to write-invalidate protocols. usual f u l l block invalidation. A couple of variants of the approach are proposed. this protowl also allows cache to obtain clean ownership of the block. Comparing this identdier with the running process identZer. supplying the same data in a synchronizedmanner. actual shanng increases and unnecessary wherency overhead is induced. A similar principle is followed in the EDWP protocol h update [Arch88]. all remote cached copies will be invalidated. It seems that invalidation threshold of this protocol is too low. which updates other cached copies and memory. Also. Additionalbuslinefordeterrmna ’ tion of the existence of duty ownersisalsoneeded. Like the EIP. This “fine-grain“ invalidation approach avoids some urmecessary invalidationsand a c h i m better data utilization. Read miss to a partiay. this group is better suited for applications with tightex shanng. serious perfolmtmce degradation of write-update protocols can be caused by process migration. Coherence maintenance starts with write-updates UnGl t h e i r cost in cycles spent h e total cost of invalidation misses for all processors reaches t possessing the block. They start with write broadcasts. Adaptive protocols Evidently. unusedwpies canbedetectedandeliminated.

some a p proaches h d it useful to d i r e c t l y support synchronization primitives along with the mechanism for Preserving cache coherence. waiting and unlocking. so the snoop waiting. and reduced only to the really necessary ones. Specific extensions of existing protocols are needed to ensure coherence in multilevel caches. We will mention here two ofthem. In the second scheme.fustlevelcache C.lowering the cache interference. The protoool proposed in pita861 is a typical example of a lock-based protocol. it obtains the lock after priontmd bus arbitration. hardware implemented FIFO queues. The common way to achieve exclusive access during write to shared data is using the system bus as semaphore.memorymodule F’igure 4: Multilevel cache Organizations: a) private b) multiport shared c) bus-based shared SB. . the other two organizations introduce shared caches on upper levels. Lock primitives are also supported in the snoopy protocol described in [Lee90]. 4. and that the third organization can also be an attractive solution. Every processor has a m t d y addressed first level cache. This principle implies that the upper level cache is a superset of all caches in the hierarchy below it. Coherence in multilevel caches Having in mind the growing disparity between processor and m e m q speed. but with a different access structure. The tag field of each cache entry. and a physically addressed second level cache. In this way. Among many successful organizations of multilevel caches. Meanwhile. contains a pointer to the next waitmg cache and the ownt of waiters in the peer pup. besicks state. which improves performance. Waiting for a lock is orgamzd through distributed. U p p e r level caches are slower but much larger. V i r t u a l addressing on the first level makes it faster because of avoidmg address translation in the case of hit. the upper level cache is multiported between the lower level caches (Figure 4b). The k s t organization is an extension of s q l e level cache. Multilevel cache hienuthy seems to be the unavoidable solution to the problem. Their task is to reduce miss latency. Consistent “ ~ r view y is still guaranteed The IEEE Futumbus standard bus pvides all necessary signals for the implementation of this model. Lock-bPsed protocols Synchronization and mutual exclusion in accessing shared data is the topic very closely related to c o w . This mechanism allows the lock-waiter to work while n l o c k is broadcast on system bus. the upper level cache is accessed through shared bus hthe lower level caches (Figure 4c). s h a r e d locks allow simultaneous access for read lock requesters. . and interrupts the processor to use the locked data. This protocol has 13 possible block states.and write policy from the class at the same time. Since coherence maintenance is aggravated in multilevel caches. single level caches are unable to successfully fuml the two usual requirements: to be fast and large enough.4. because of its cost-effectiveness. This gain justi6es some space ine5ciency incurred by inclusion. cache memories become more and more important. Lower levels of such a hierarchy are smaller but faster caches. to make it more efficient. although new problems like hand@ of b) A 4 . The main disadvantage of t h i s scheme is that it does not differentiate between read lock and write lock requests. 5. Caches which try to obtain lock are chained in a waiting list. The advantage over previous protocols is that shared and exclusive locks can be dismgwhed. where every processor has its private hierarchy of caches (Figure 4a).clustabus P-P-SQr 869 . In t b s way. A specific two-level cache hierarchy is proposed in [Wang89].second level cache CB. coherence actions towards lower levels are filtered. in order to attain higher hit ratio and reduce the traffic on the intemmmction network.systembus C. It supports efficient busy-wait locking. U busy-wait register of waiter can it upon match. with write-back on both levels. without use of the test-and-set operation. three of them appear to be the most important -881. unsuccessll retries for lock acquiring are completely removed hthe bus which is a critical system resource. Two additional states dedicated to lock h d h g are introduced:one for lock possession and the other for lock waiting. common for several processors. while in the t h i r d organhation. which combines coherence strategy with synchronization. and incurs a significant additional hardware complexity. Besides private first level caches. Applying inclusion to these types of cache himchies shows that the conditions for simple coherence maintenance can easily be met in the first type of organizaton. Most of the snoopy coherence schemes enforce strict consistency model and do not address these issues explicitly. it is necessary to follow the principle of inclusion. Then. Meanwhile.

as well as memq. The hierarchical cluster architecture similar to the previous one also characterizes the Data Diffusion System m 8 9 1 .. Sigrdicauttraffic reduction can be attained in this way.. Each node includes dedicated two-level caches. Following the inclusion property. For the same type of organization. Among other means for memory access ophiiation. In strive for more processing power.I# : M -globalmemory module ZCN . we strongly believe that a very promising approach in attainuse of ing the optimal pedonnance is the complementa~~ e a n s . and write-invalidate (modified Berkeley) protocol is used on the global level. Cache coherence in large shared memory multiprocessors One of the most required characteristics of hardware cache coherence solutions is the scalability.. Small 6rst level cache “ k mmiss latency. compared to smgle bus systems.. Since process allocation which results in tighter sharing on lower levels and looser sharing on higher levels is naturally expected. while caches on higher levels are shared. D i rectories for memory blocks reside in large local memory modules.globalinterconnectionnetwork cc . in order to be storage efficient. .. It is solved by maintainingpointers between copies in the first and the second level caches.We expect that t h i s is a direchardware and software m tion where real breakthroughs are likely to happen.. but it differs hthe Wisconsin Multicube i n the memory distribution and the cache coherence protocol.__. There is no global directory since it is p a r t i t i d and distributed across nodes. The same topology is applied in t h e Aquarius m u l t i p essor [Carl90]. Memory is completely distributed across nodes which also contain large caches. A cluster architecture with memory modules distributed among clusters is also proposed This approach is applied in the Encore’s GigaMax project.. splitting the first level cache into the instruction and the data part is advocated. .synonyms are introducsd.. *.. the schemes that combine principles and advantages of both snoopy and directory protocols on different levels seem to be highly effective for these systems.Cluster~troller CB . Very large processor caches implement virtual shared memory.. . which depicts the ability to support efficient€ylarge-scale shared memory multiprocessor systems.. which is independmt of the network type.clusterbus LM .localmemory C -cachememory P -p”sor Figures: Hierarchical organization of a large-scale shared memory mukipmessor 870 .hardware and software solutions are most of the time developed mdependenttly of each other. snoopy write-update (modified Dragon) protocol is used inside the subsystems.. Another protocol proposed for hierarchical two-level bus tapologv with clusters can be found in [Arch88]. -. the DASH supports a more relaxed release consistency model in hardware. snooping and directory schemes.. The protocol incorporates selected distributed features of snoopy protocols and consists of an &emd and a global portion.____. In order to maease memory bandwidth. Processing nodes are placed in the intersections of busses and memory modules are tied to column busses. Snooping principle maint a i n s coherence of caches on individual busses. The MlT’s Alewitie m u l t i p r o c e s s o r uses the LimiUSS p t w l which represents a hardware-based cache coherence method supported with a software mechanism [Chaigl]. .. That is where a set-associative directory for data blocks on the lower level resides.-. The controlleralso monitors the activities on the busses in a snooping manner. Wisconsin Multicube is a highly symmetric bus-based shared memory multiprocessor which supports up to 1024 processors [Good88].. System-wide coherence is maintained using a directory-based protocol. Exceptional chum- 6. Inclusion property is also applied. : M . Coherence solution combines features of .other systems are oriented towards general n ischemes.Hierarchical write-invalidate ptocol is used cache coh~ ~ l ~ twhich i o n fuUy supports data miption Clusters are connected to hlgher levels through data controllers. it can flter the invalidation actions to lower level caches and restrict c o h “ y actions only to sections where it is necessary. while directorymethod is usedtopreserve coherence among busses. is proposed in [DeWa90]. The invalidation protocol issues at most twice as many bus operations on multiple busses in order to acquire blocks. Same of them try to retain the advantages of common bus systems and overcome scalability l i m i tations by introducing multiple-bus architectures. a somewhat different coherence solution._____. as a part of an extended invalidation protocol for pnxerv@ coherence.. j ’ : i . ApproPriate invalidation snooping solution (extendedand modified Wriprotocol) is employed for c o w maintenance. It i s one of the first operational machines with a scalable cache coherent Each node contains private caches and employs a snoopy scheme for cache coherence. . A quite diffmnt multiple-bus architecture is proposed in [Wils87].However. . . D i rectory entries implement only a limited number of hardware pointers. It is connected to both row and column busses and is also responsible for coherence maintenance by snooping on both busses. Lower level caches are private and COMected to local busses. Stanford‘s DASH distributed shared memory multipmessor system is composed of common-bus multiprocessor nodes linked by scalable inkrw”ect of general type [LenogO]. counting that it is sufticient in a vast majority of cases. a broad range of various architectures has been proposed for those systems -1. while a vey large second level cache is intended to reduce bus traffic.A firesuent sohtion networh and --based tion is to organize systems into bus-based clusters or subsystems which are connected with some kind of n e t w o r k Canss quently. So f a r . Appropriate snoopy protocol extensions are the key issue for these systems.. Caches and busses are hierarchically organmd in a multilevel tree-like structure.. j c .. each approach trying to solve the problem only in its own domain. which accounts the type of sharing on various levels.

1985.. SegaU Z..” Proceedings o f the NATO Advanced Study Institute m Micnxamhitecture of YLSI Computers.. 2-14. 749-753. Wang W.” Proceedings o f the 15th lSC. 1989. “Multiprousor Cache Synchronization: Issues..V0l. L. 1. and a full-map directary for the block is emulated.. essor-Memory Traffic. February 1988. ‘The Cache CohexemeProtocol of the Data Diffusion Machine. pp. ‘“le Directmy-BasedCache Coherence Pm tocol for the DASH Multiprocessor. [Egge89] Eggers S.. Laun%i. ”LhitLESS Directories: A Scalable Cache Coherence Scheme. [Arch851 Amhibald J.. University February 1987. Sohi. Feautrier P. “An Econormcal Solution to the Cache Coherence Problem.1984.pp.. N0. Despain A. [Tang761Tang C. Gharacholoo K. 57. pp. pp. [Sten90] S t e ” P.. pp..” Proceedings o f the 13th ISCA. 1986.June 1990. Vo1. J.4. Vo1. Newton A.J. Coherence.. [Good831 Good“. Eggem S.21. pp.’’ lEEE Tronsactim on Copnp~ters. [Chai91] Chaiken D.... 36-49. 1988pp.. [Carl901 Carlton U. ” 1990.” Prof the 18th ISCA...” Proceedings o f the 15h ISCA. K e K.6. ‘‘Spchr(mization with Multiprocessor Caches.1991. 424-442. [Arch87 Archibald J. . J. 1989. [Swea86] S w e a z e y P. J.C-27... m 0 9 0 ] Lenoski D. Evolution. “Using Cache Memory To Reduce Pnx. N0.1990. May 1988.2.June 1986. Kubiatowitz J.” IEEE Micro. ‘’Competitive Snoopy Caching. pp... p+an84] Frank S . it still represents a very active research area... pp.In those idkquent cases. Smith A. “Cache Coherence Protocols: s i n g a Multiprocessor Simulation Model... Agarwal A. Patel.. Scheunch C.. “Synchronizati~~~. “A Class of Compatible Cache ConsistenCy Protocols and their Support by the I E E E Futurebus.. Sheldon R. 1989. Doubtlessly. U. “A Cache CoherenceApproach For Large f the SupercomputMultiprocam System. References [Agar891 Aganval A. 348-354. * . “An Evaluation of Directory Schemes for Cache Coherence. G~essing.. Briggs F.. L. 414-423. A.. Despam A.276-283.” Pmceedings o f the 16th BCA.”Proceedings o ing Confeerence. July 1984.. D..23. ‘‘Aquarius Project. “Scalable Coherent Interface. so we decided to provide a list of papers for additional readmg.6. “A New Solutionof Coherence Protocol for Tighly Coupled MultipIwessor Systems. 7. June 1990. Satterthwaite E. [Arcd$[h=J. pp. Mauasse M. [Cens78] Censier L. 1988.. 337-345.. W.Vo1.. 5749.H. June 1990. and evaluatingthe solutions should be done with sigdicant prospective benefits.” Proceedings o f the 17th I%. “Firefly: A 8 1 Multiprocessor Workstation..Innovations.“A Low-overhead Coherpapa841 Papqarcm. 273-298. 207-214. 138-147. L.. “Tightly Coupled Multiprocam System Speeds Memcny-access Times. J. pp. 27-37. K. January 1984. pet901 Prete C. m 8 9 ] Haridi S. The great importance of the problem h e applied solution on system perand the strong impact of t formance puts an obligation on the system a r c h i t e c t s and designers to carefully consider the aforementioned issues in their pursuit t o w a r d sm o r e powexfd and more efficient shared memory multiprocessors.. 8.. B. Fields C.23. [Good881Goodman J.S. Sleator D.an Early Overview. [Chai90] Chaiken D.” Evaluation U ACM Tmnsactim on ComputerSystems. Hemessy J. “On the Inclusion Properties for Multi-level Cache Hierarchies. “Evaluating Performance of Four Snooping Cache Coherency Protocols. Simoni R. 124-131.. 1985..” Proceedings o f the 11th ISCA. m 9 0 ] Lee J. Baer J. November 1986. W a Cache Consistency Protocol. ” Report No. No.1990.4..” IEEE Computer. 49-58.6. 414-4423. Gupta A. “A S u e CODV Data Coherence Scheme for Multi~rocessorSvs &..” Proceedings o f the 17th ISCA. Rudolph L. 1983..” Proceedings o f 12th I CA. paex881 Baer J.12... Hmwitz U..” Proceedings o f the 27th Annual Symposium on Foundotim o f Computer Science. “A Survey of Cache Coherence Schemes IEEE Computer..” Computer... Berkeley.. 355-362. Vol. pp.1. Slngh A.- ‘%p1-3 87 1 . .. this topic deserves much m o r e time and space. Despite of the considerable advancement ofthe field. N0. December 1978. 164-169. [Arch861 A r c h i d J . b. pp. Conclusion This survey tries to give a comprehensive overview of hardwmbased solutions to the cache coherence problem in shared memory multiprocessors. pp. V. 148-159.. [Egg&9a] Eggers S. IEEE” Computer.: % U 89/5Ul. “The Cache Coherence Problem in Shared-Memory Multiprocessors. 244-254..pp. o o d D.. 280-289. February 1988. “A New Solution to Coherence Problems in Multicache Systems. 12-24. are handled in so% w a r e .pp.R. A fast trap mechanism provides support for this feature. Pradhan D. ceedings o ” 8 1 Dubois M... implementing. Much of the w o r k in developing. ence Solut~onfor Multip~wessorswith Private Cache Memories. pp.” PhD Thesis.” Proceedings o f the 12th ISC4.” h e e d i n g s o f the 17th ISCA.” Proceedings o f the 11th iSCA.. pp. [Katz85]Kak R. Agarwal A:. Baex J.’’ IEEE Computer..” Procdings o f the 16th ISCA. [Kar186] Karlin A. on a number of merent subjects related to the general topic of cache coherence in multiprocessor systems. Vo1. D.-?T. June for M u l t i . ‘?)lrectory-Based Cache Coherence m Larg&e Mulb j ” . Hagaskm E.”Micropmessing mdMicr0 rogm”ing 30. 1976.. M. V01. pp.” ComputerArchitecture News.” Proceedings of the PARLE 89.” Electronics. an intenupt is generated. N0. Woest P. 1112-1 118. “The Dragon Computer System. [ J d ] James. Perkins C. S t e w a r t L. Kalz R.. 422431. 1990.“Cache System Design in the Tightly Coupled f the National Multiprocessor System. 80-83.1989. 224-234. “Simulation Analysis of Data S Shared Memory M u l t i . Hennessy J. when more pointers are needed.. WcC1-841 McCreight E.pp. pp.S. 1986. and Event ordering in Multiprousors. Laudon J.“TheWisumsin Multicube: A New Large-scale cachecoherent Multiprocam. pp.” Proceedings o f the 10th ISCA.23. pp.. pp.4.. “An Em i r i d Evaluation of Two Mem -Efficient Directory M&&. Thacker C. N0. Italy. Urbho.pp. pp 74-77. University of California.stances.April 1989. pp. [OKra90] O’Kdka. 9-21. Pud0841 RudofphL.. 1990. mend891 MendelsanA. . 348-354. G.” Pmeedings o Computer Confmce. 1984. Ramachandran U. ‘’Dynam~c Decentraked Cache Schemes for MJMD Parallel Processors. pita861 Bitar P.” Proceedings o f the 13th ISCA.

...” Proceedhgs ofthe 3dASPWS. Shem B. 1987. D e w B. Vermm M.S .1989. ‘Effects of Cache Coherency i n M U/ tip”. 140-148. W * il CaChdBlls A l C h i for Shared Memory Multlpr~~e~sors.” Proceedings of the Petfomance’86andACMSigmetrics 1986.” Cmputing Surveys.” S p h l hcahty IIL Mdtl 9 1 Technical R e p r t m -397. 153-161. 1083-1099.23..2. Levy R.L..1989. 155-160. .“Perfonn8nce Asalysis of M u l t i 7 cache Consistency Protocols U . ma^. April 1989. October 1987.6. V0l.... 243-256. Baer J. 9. ~ $ 7 G.May 1988. 14. Hill I d . [Cheo89] Cheong H. . . analytic and simdation models of p”tocols.. Milutinovik V . [Cytrs8] Cytron R . ‘‘Scalable Shared-Memoq Multip~~cesm Architechres. [ThaP90] Thapar U. [Sites81 Sites R Agarwal. pp. “A Version Control A p preach to Cache Coherence.. Briggs F.Katz R.. pp. ~slpta A. It gives a deeper insight into the cache c o b p b l m in shared memory multiprocessors.. Gertner L.. W w W. [Smits2] Smith. pp.. Jan~ary 1985.”Pmeedings of the May 1989. N0. part921 T d j a I. ”Cache Memories. Boyle P.. 257-270.” Z E E E Tmnsactim on C%nputers. “Cache Coherence for Sbared Mem Multilnmcesm. “Shared Memory Multiproces5ors: The Right to Parallel processing. “A C . . July 1990.” IEEE Tronsoctons on Computers. [Sten891 Stenstrom P.A. L. “C&ache Consistency w i t h Software Support and Using One-Time Identifiers. 322-330.. . 56-65. 33-46. “Correct Memory operation [%he871 Scheunch C.. Wilson A.” Proceedings of the 3rd ASPWS. pp. pp.. ”The Effect of S h a r i n g on the Cache and Bus Performance of Parallel Programs. June 1989.McAuliffe. pp.” Proceedings of the 25th HICSS. 9-17. . A. ”Data Coherence Tyen851 Y System.Laundrie A.” Pmceedings of the 2nd Annual y ” on Pamllel Algorithms and ArchitecACMY turn.’’ Proceedings o f the 1989 ICPP. [Adv&l] Adve S . C-32(1). April 1989. Fu D. 407415. 9 ad Memory Referenw.” Proceedings o f & 15th Z U % 3 0 % 3 1 S ..”Proceedingsof the 15th E T U .pp. “Paradigm: A H q h l y Scalable Shared-Memary Multicomputer Archite~ ture. “An Apjxoach to ! Software Cache Consistency Maintenance ased on Con&tional Iuvalidation. F e w 1991. Vo1. N0. January 1992. E. 298-308. 230-242. p d ] V e ” U.Katz R. Seoul.. Y e n D. P a J. 1987. Cache Ccw 9 ] Min S. J. pp. [Dub082 Dubois U.. . ‘Camparison of Hardware and Software Cache Coherence Schemes. [Good871 Goodman J. “OfSharingin P a r a l l e l Prqgrams and its Application to cohaency Prow col Eval~on." Joumal ofPamlle1 andDisaibutedComputing.Bartllet J. pp. “Analyans 3 c a c h e Iuvalid a h mP a t t e r n sm M u l ~ s s o r s . pp. Holliday U.May 1988.C-31.’Orgmmhon and Performance of a TwpLevel V i r t u a l R e a l Cache Hierarchy. M. “A Ti”pBased herence Scheme.A.(2-34. Proceedings ” ofthe 3rd ASPWS. [ C h d l ] Cheriton D. W.. 78-80. G. . January 1983. [wang89] Wang W. 24. 9 p@8] Eggers S . . 244-252..” Pmeedings of the Spring cnnpmn’89.. 123-132: . 229-238.L. 72-81. L. Baer J.Karlov S. R. ‘‘Auto~natic Managemeat of Programma le Caches. Dupois U. The papers cover a broad range of topics in the general area of cache cohereme. 457-466. 427436.1991.” proceedings o f the 14th ISCA. “Shared Cache for M u l ti le-stream computer Systems. Veidenbaum..... . S Juue 1989. pp.Agarwal. Baer J..11.373-383 @Zggg9] Eggas S. pp. pp. 6. [Agar891 -. “Evaluating the Perf[Owic89] ow1cki. pp.? Gupta. pp. May 1986.. Goosen R. C. peh831 Y e h C.” Proceedings o f the Pacific Computer CormtunicationsSymposium. Adve V.” 1. “COfor Mdtl-r Virtual Caches. A.” Computer. e n W. pi^?] weber w:D. “A Cache Coherence Protocol for Multiwith Multistage Networks. Dubios M.S of Software Cache cohe”. Y. parallel program characteristics. Sohi G.” Proceedings o f the International Conjknce on Supercomputing 89. Chen P.Generalized Timed P e t r i Nets.. 1989. H. .. August 1988. [ W w qWilson A. performanceevaluation and cOmpafiSOlZetc.1- Thakkar S . [Toma921Tcamkvi6.. 3847. pp. 72-80. prdocols and Performance.451476.”Pmeedings o f the 16th ISCA. pp.” Proceedings of the 1988I P P .” proceedings of the January 1992. L. pp...” IEEE Computer. “Multi Cache Analysis Using A * Procedings 15th LSCQ. pp. pp. Lazowska. ! . Milutinovi6 V . A r a l Z. Davidsor~ E.A. June 1990. September 1982 473-530. pp. “A Simulation Study Cohexnce Protocols.Vol. November 1982. J.. pp. . “Multile~l Cache H i m chi= oqmzat~ons. [wood891Woodbury P.. ” Pmceedings of the 14th ISC4. of Cache-Based Multl I S .” LlGW TmnsoCron on Problem in A Multi& CompurerS.” Pmeedhgs of the 18rh ZSC4.. Verqaq M.” Proceedings o f the 2nd ASPLQS. Apnl1989. --?? 872 .. June 1987. N0.pp. Suggestions for further reading Here is provided an additional list of rehence (not ment i o n e d in the paper because of space restrictim). “An Accurate p~xn881 and Effiaent Perfinmance +ym Techque for Multup Protocols.3. 186-19s. [Smit%S]srmth A.

A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors M.Milutinovic Please see page 863 for this paper 496 . Tomasevic and V.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->