Column-Associative

Caches: Caches

A Technique for Reducing the Miss Rate of Direct-Mapped
Anant Agarwal and Steven for Computer Institute MA D. Pudar Science

Laboratory Massachusetts

of Technology 02139

Cambridge,

Abstract
b Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accesses allowing conflicting addresses dyby to namically choose alternate hashing functions, so that most of the cordiicting data canreside in the cache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set stores data that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz these rehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremoves virtually all interference missesfor large caches,without altering the critical hit accesstime. ai aj

Index [ Tag address

Cache

Figure 1: Indexing into a direct-mapped cacheusing bit-selection hashing.

the block that exhibited the hit.l In a d-way set-associative coche, each set contains d distinct blocks of data accessed by addresses with common index fields but ditYerent tags, When the degree of sssociativity is reduced to one, each set can then hold no more than one block of data. This configuration is called a direct-mapped cache.

1

Introduction

For a cache of given size, the choice of its degree of associativity infhsencesmany performance parameterssuch asthe silicon
area (or, alternatively, the number of chips) required to implement the cache, the cache access time, and the miss rate. Because a direct-mapped cache allows only one data block to reside in the cache set that is dwectly specified by the address in&x fiel~ its miss rote (the ratio of misses to total references) tends to be worse than that of a set-associative cache of the same total size. However, the higher miss rate of direct-mapped caches is mitigated by their smaller hit occess time [2, 3]. A set-associative cache of the same total size always displays a higher hit access time because an associative search of a set is required during each reference, followed by a multiplexing of the appropriate data word to the processor. Furthermore, duect-mapped caches are simpler and easier to desi~ end they require less area. Overall, direct-mapped caches are often the most economical choice for use in workstations, cost-performance is the most important criterion. where

The cache is an important component of the memory system of workstations and mainframe wmputcrs, and its performance is often a critical factor in the overall performance of the system. lle advent of RISC processorsand VLSI technology have driven down processorcycle times and made frequent referencesto main memory unacceptable. Caches are characterized by several psmsneters,such as their size. their replacement algori~ their block size, and their degree of associativity [1]. For cache accesses,a typical address a is divided into at least two fields, the tag field (typically the highorder bits) and the index field (the low-order bits), as shown in Figure 1. The index field is used to reference one of the sets, and the tag field is compared to the tags of the data blocks witim that set. If the tag field of the addressmatches one of tag fields of the referencedse~then we have a hit, and the data can be obtained ftom

1.1

The Problem

Unfortunately, the large number of interference misses that occur indirect-mapped cachesare still a major problem. An interference miss (also known as a conjict miss) occurs when two addresses map into the same cache set in a duect-mapped cache, ss shown lIrs most caches,more than one data word can reside in a data block. In this case, an oflset is the third and lowest-order field in the address,and it is used to select the appropriate data word.

179

0SS4-7495193 $3.0001993

IEEE

and a possible third to store the datum into the primary cache. 2 Previous Work Several schemeshave beenproposedfor reducing the nusnberof interference misses. will miss and force the data indexed by ai out of the set. The scheme in [6] is proposed for instruction cachesand uses two instruction buffers (of size eaual to a cache line) between the instruction cache end the instruction register. for a . however. but in Section 3. A general approach to improving duect-mapped cache access is Jouppi’s victim cache [4]. end aj. but it is often a substantial portion of the overall miss rate. In a duect-mapped cache. it requires two or more access 3 Column-Associative Caches The fundamental idea behind a colm-associative cache is to resolve cotttlicts by dynamically choosing different locations (accessed by dtierent hashing functions) in which cordlicting data 180 . the second to check the victim cache. ai. the reference to a. To help explain the behavior of the column-associative cache. Clearly. By maintaining direct-mapped cache access. Kessler et al.1 we demonstrate that it has one serious drawback. its hit access time is the same as that of a direct-mapped cache. Thus. al. Tag checks. it would maintain the critical hit access pads of the dmect-mapped cache. that differ ordy in some of tie higher-order bits (which In this case. Section 3 presents the column-associative cache. the addresses will have ditTerent tags but identical index fields.. the second reference. However. but does requke a second access. The rest of this paper is organized es follows. because the area available to resolve contlicts in the column-associative cache increases with primary cache size. If it ian’~ then en associative search accompanies a second access. The challenge. we also develop and validate an analytical model for this cache. If we denote the set that is selected by choosing the low-order bits of an eddress a es b[a]. they will reference the same set. A victim cache is a smell.in Figure 1. .) Because of ita fixed size relative to the primary direct-mapped cache. a long sequential reference sueem of length equal to the cache size would fit into a duect-mapped cache. However. reference stream of confhcttng addresses. It maintains a few bits with each cache set indicating the most recently used block in the set. Consequently. [8]. On the other hand. the third reference. The principle focus of this study was a reduction in implementation cost. The data that is forced out is placed in the victim cache. Using trace-driven simttlatiom we demonstrate that its miss rate is much better than that of Jouppi’s victim cache [4] and the hash-rehash cache of Agarw4 Horowi@ end Hennessy [5]. The technique introduced in Section 3 removes this drawback and largely eliminates interference misses by implementing slightly more complex control logic to make better use of the cache area. which occur commordy. in a d-way set-associative irnplementstionof this schemq only 1/d of the references would succeed in the first access. [7] propose inexpensive implementations of setessociative caches by placing the multiple blocks in a set in sequential locations of cache memory. such es ai aj ai aj . this scheme requires a sizable victim cache for adequate performance because it must store aZl cottllicting data blocks. Furthermore. The name b comes from the bit-selection operation performed on the bits to obtain the index. . will not require ac- cessing main memory because the data can be found in the victim cache. A set-associative cache will not suffer a miss if the program issues the above sequence of referencesbecausethe detareferenced by ~i and Uj can co-reside in a set. (One cycle is needed to check the primary cache. This decrease in the interference miss rate is achieved not by set essociativity but by exploiting temporal locality to make more efficient use of the given cache area-a notion called column Ossociativity. fully-associative cache that provides some extra cache lines for &ts removed from the dxect-mapped cache due to misses. at. is determining a simple. An access to a given set immediately reada out ita MRU block betting on the likelihood that it is the desired block. will result in an interference miss because the data horn ai occupies the selected cache block. A similar problem exista in the MRU achemeproposedby So et We believe these interference misses can be largely eliminated by implementing control logic which makes better use of cache mea. Consider referencing a cache with two addresses. 1. TheMRU scheme is ameanafor speedittgup set-associative cache accesses. . it resolves virtually all cofilcts in large caches. and virtually the same es that of a two-way set-associative cache. area-efficient cachecontrol algorithm to reduce thenumber of irtterferencerstiases and to boost the performance without increasing the degree of essociativity.2 Contributions of ThM Paper This paper presents the design of a column-associative cache that resolves conflicts by allowing sdternate hashing functions. end an instruction encoding that makes it easy to &tect the presence of branch instructions in the btd%rs. these achemea do not effect the critical hit access time. which results in significantly better use of cache area. and Section 4 develops an analytical model for this cache. a. both our results end those presented by Jouppi (see Figure 3-6 in [4]) show that it is not very effective at resolving cotilcts for large primaty caches. . but at the same time. avoid the wide datapathrequiremenrs of conventional set-associative caches. Section 5 presents the results of trace-driven simulations comparing the performance of several cache designs. often occurs in multiprogremmin g environments). times to fetch a confhcdng datum. therefore. The performance (measured in terms of average access time) of this scheme could often be worse than a direct-mapped cache for long strings of consecutive addresses. a tsvo-way set-associative cache does not require en associate search. then we have b[ai] = b[~j] for conflicting addresses. Unfortunately. With proper desigm the few additional cycles required to execute the algorithms in case of a miss are balanced by the decrease in the miaa rate due to fewer confticts. only 1/d of dte references in a long sequential address smesm would result in first-time hits into a d-way set-associative cache using this scheme. done serially. and Section 6 concludes the paper. The percentage of misses that are due to cofiicts varies widely among different applications. Assume the fOllOWitlg reference pattern ai a~ Sti aj ai aj “ c. The next section d~cusses other efforta with similar goals. A more desirable cache design would reduce the interference miss rate to the same extent es a set-associative cache. then. and subsequent references to any of these locations would result in a first-time hit. For example. The hash-rehashcache [5] had similar goals. LAe the column-associative cache.

This technique is necessary when bk fl@ping is implemented. This suggests a simple solution to the simation. the column-associative implementation uses sets within the cache itself to store ccn-dhcting dati. the set-associativecacheresolves the contlict statically by referencing another location within rhe same set. which we term bitj?ipping. Fnk we &scribe a basic system that uses mtdtiple hashiog functions end discuss its drawbacks. if the second access also misses. Consider two addresses.t miss 010 011 100 101 ia :01 or 011 110 111 Cache Figure 2 Comparison of column-associative and two-way setb[ai] = b[a. ordy a simple rehash of the address k required to access this data. as a second hashing function results in The use of bit ftipping mapped caching by resolving a large number of the cofl~cts encountered irs an address stream. The cofllct resolved by both schemes. ‘f’heretnainder of ourd~cussiottproceeds in two steps. Of course. This is unacceptable.set o 1 2 3 4 s 6 7 a. Figure 2 comparesthe column-associative cachewith a two-way set-associativecacheof equal size. which differ ordy in the high-order bit of the index field @tat is. Usrng two or more hashing functions mimics set sssociativity. as illustrated in Figure 4. ] k associative caches of equal size. the tag fields are results in a hit with a data identicaL thus a rehash access with f[ai] block that should only be accessed by b[a=]. Therefore. then the data is retrieved km main memory. For addresses whose indexes are the same and which thus reference the same se~ the tags are compared in order to determine whether an address should access the data block.1 can resi&. If hl [ai] indexes to valid daQ afisf-fime hit occurs. but it is likely that the complexity of implementing such an algorithm would add little to the performance and might even degrade it. One simple choice for this other hashing function is bit selection with the highest-order bit inverted. ai tmd a. b 001 010 011 100 101 110 ai aj aj to the need for an associative search and for the logic to maintain a least-recently-used replacement policy. For simplicity and for speed. Thii scheme is assumed to be in place whenever bit flipping is used. because a date block must have a one-to-one correspondence with a unique address. the data is retrieved.. a victim-cache implementation requires an entirely separate. The data in the two cache lines are then swapped so that the next access will likely result in a fust-time hit. we add rehash bits to this &sign to alleviate its problems. Then.] is then used to acceas the cache. rehashing (tj with hz resolves the cotdiict with However. and swapped with tie data in the first location. However. On the other hand the column-associative cache is dwect-mapped. which can be thought of as a cdttrut of sets-thus the name column associarivily. and when presented with conflicting addresses. placed in the cache line indexed by hz[ai]. as illustrated m Figure 3. The rehash with ~[ai] wilt correctly fail because the data block is once again referenced by a unique address. Of course. where hl [a] &notes the index obtained by applying hashing function hl to the address a. especially the lower hit access time. a=. Tag I Index 111 address Figure 3: In&xing The conflict b[~i] Cache into a cache by bit selection srtdby bit fiipping. ai and aj for which hl [ai] = hl [at]). hz[a. the extra cycles required to implement the columnsssociative sdgotirhrn on a miss cart be easily bahmced by the small improvement in hit access time on every hi~ restdting in a smaller average memory access time when compared to a two-way setessociative cache. = 3. By comparison. and bit flipping is often used for hz (that i% k = f). fidlysssociative cache to store the conflicting data. EIEI Tw&Way Set-AaaeciaSive ai set a~ 01 I 010 110 } ] aX I 01 I 010 011 I 010 110 1- Cotumn-Aaaeciative E f f mrrec. Additionally. a different hashing function is dynamically applied in order to place or locate the data in a different set. These two addresses are distinct however. If b[a] = 010. If a second-time hit occurs. notice a high probability (that is. b[~j] is resolved by the bit-flipping rehash.2. then ~[a] = 110. if it misses. that the hit access time of a firxt-time hit remains unchanged. Not only does the victim cache consume ex~a area but it can also be quite slow due a potential problem. hl = b). When presentedwith conflicting addresses(b[ai] = b[aj]). storing cortfiicting data within the cache-instead of in a separate victim cache-very likely results in the loss of useful da~ but this effect (henceforth referred to es clobbering) can be minimid es discussed in Section3. Column associativity can obviously improve upon dKect- Multiple Hashing Functions Like the hash-rehash cache in [5]. 000 Figure 4: Appending the highader blt of the index to the tag. because for conflicting addresses (that is. appending the highorder bit of the index field to the wig. Because hits are much more frequent than misses. 181 . In additio~ as long as the control logic used to implement column associativity is simple and fast then the benefits of direct-mapped caches overset-associative caches (ss discussed in Section 1) are maintained. to access the cache. column sssociativity could be extended to emulate degrees of associativity higher than two. ~[ai] = b[a=]). the tirst-time awess is performed with bit selection (that is. wlttmn-esaociative caches use two (or possibly more) distinct hashing iitttctions. cofllcts are resolved not within a set but within the entire cache. ht and hz. hl [ai] # Itz[aj]).

end one cycle wasted during the swap. 1. . if the fist-time access is a miss. for all decision trees in this paper.] = b[a. we provide access drne results for both one and two cycle swaps. The design requirements for accomplishing a swap in two cycles ia discussed in Section A of the appendix. is art address which maps into the same location with bit flipping (that is. note that after a fist-time miss end a seeondtime hi~ which require two cycles to complete. which eatteins the rehashed data i. Therefore. end j[~i] = b[a. If it has been set to one.]). This pattern continues ss long es aj end a. then it must be true that j[ac] = b[a. . But when this Iirst-time miss occurs. one cycle for the rehash access. We will refer to this negative effect es secondary threshing in the future. as listed in Table 1). In the decision tree. b[ai] = b[aj] end f[ai] = b[a=]. According to Table 1. Consider the following reference pattern ui a j a= aJ a= aj as . the hash-rehash algorithm has been expressed as the decision tree in Figure 5.]. a swap is performed. all of the empty cache locations should have rhetr rehash bits set to one. if f[a. To illustrate the operations more clearly. the data referenced by one of them is clobbered es the inactive date block i ia swapped back end forth but never replaced. ia similar to that of the hash-rehash cschq however. es &sired. the key difference lies in the fact that when a cache set must be replaced a rehashed location is always chosen—immediately if possible.. However. (which will be called j for brevity) and the data i 3. .]. even when the primary location had an inactive block. whether the data in this set is indexed by j[a]. Index \ F~\ .done 1 ‘Y ‘[’] w’ ‘Yma] Y -P dobber2 i ai [ Tag Tag Index L —— r + \ 5. given en extra buffer for the cache. both the hsahrehash and the column-associative algorithms will have the data referenced by a. we assume that a swap adda only one cycle to the execution time. which is illustrated es a decision tree in Figure 7. if ai accesses a location using b[a. this swap need not rnvolve the processor. In this case. place in set b[a] check if set b[a] is a rehashed location o . where b[(ti] = b[a. Then the rehash bit is reset to zero to indicate that the date in tbia set is to be indexed by b[a] in the future. simply a translation of the verbal description of the hash-rehash algorithm into a tree structure. The following section describes how the use of a rehash bit can lessen the effecrs of these limitations. then the rehashed-location bh (or rehash bit for short) of that set ia checked (RbtiIEl ?. Table 1 explains the mnemonics used in this decision tree and in the othera which are introduced in this paper.]. Note that if a second-time miss occurs. Thii simation ia illustrated in Figure 6. umnemonic b[a] f[a] swap actwn bit-selection bit-flipping aeeess access I cycles u 1 1 2 M M 0 swap data in sets accessed by b[a] end j [a] clobber2 clobberl Rbit=l ? get date from memory. Given two addresses a. then the set whose data will be replaced is agein a rehashed location. (However. or 2. end the date rerneved from memory is placed in that location. alternate. as can be seen in Section 5.2 Rehash Bits The key to implementing column sssociativity effectively ia inhibiting a rehash access if the location reached by the first-time access itself contains a rehashed data block. respectively.3. then the time wasted by a swap is one cycle. at start-up (or after a cache flush). the hash-rehash algorithm next tries to access f [a=]. Table 1: Decision treemnemonics end cycle times for eachaction. This algorithm. place m set f [a] get data tim memory. Different fonts exe used to indicate different index fields end tam. ia encourtterd both algorithms attempt to access the set b[a=]. if the rehash bit ia already a zero. Every cache set contains an extra bit which indicates whether the set is a rehashed locution. Therefore. If this ia the case half of the time. which may be able to do other useful work while waiting for the cache to time available again. which can replace potentially useful data in the rehashed location. then upon a first-time miss the rehash access will contirtue as described in Seetion 3. Of course. the hash-rehash cache haa a serious drawback. in the non-rehsshedend reheshed locations. which results in a second-time miss and the clobbering of the data j. and aj map into the same cache location with bh selection. The accessed location ia an empty location horn start-up. then no rehash access will be attempted. a. 000 001 010 a. the three cycles indicated m the swap branch of Figure 5 results from one cycle for the initial cache access. On the other hand.) Thus. Thus. When the next address. The source of its problems is that a rehash is attempted after every fist-time miss. . and a./” – 011 100 101 110 111 long memoty reference done 3 ––––-”—* Swap done: accass complete +L%J%!P Figure 6: The potential for secondary threshing in a reference stream of the form ai aj a= aj a= aj as . The table alao includes the number of cycles required to complete en ectiorL which is necessary for the calculation of average access time.] whose rehash bit ia set to one. After the first two references. that is. The reason that this algorithm can correctly replace a location with a set rehash bit immediately after a first-time miss is baaed on the fact that bit flipping is used es the second hashing function. which often re~uccs its performance to that of a direct-mapped cache. . Figttre 5: Decision tree for the hash-rehashelgorirhm. .. Unfortunately. where the addresses a. the swap requires an additional two cycles to complete.1. there exists a non-rehashed location at f [ai] (or b[a=]) which previously encountered a conflict end placed the data in its 182 . This idea can be implemented as follows. rhen rhere are only two possibilities. end a.

A lint-time access to this location. Although this assumption makes the models generally overestimate miss rates. aj a= aj a= c. Like the self-interference model in [9]. which yields functions were included. f[ai] Therefore. Consider the actions taken by the algorithm when one of the conditions precedes the other. In both cases. this section summarizes the major resuhs. The parameter u denotes the working-set size of the program. Our model builds on the self-interference component of the direct-mapped cache model of AgarwaL Horowi~ and Hennessy [9]. .)”-d (1) are expressions for the number of conflicting c. !7Rb’’-”x 4 A Simple Associative Analytical Caches Model for Column- 1 clobberl We have developed a simple analytical ‘> Sw. In addition to limiting rehash eccessmandclobbering. . the column-associative algorithm attempts to exploit temporal locality by swapping the most recently accessed data into the non-rehashed location. () cd= (+)’(1 u . but it finds the rehash bit set to one. Because blocks are assumed to map with equal likelihood to any cache se~ the distribution of the number of blocks in a cache set is binomial. the data i is replaced irnrmdately by z. rehashed locatioz ~[az]. However. the rehash bits in the column-associative cache elitninate secondary thrashing. Firsq if b[ai] ia a rehashed location.1 provides further discussion on the notion of contlicts. This important property could not be utilized in the column-associative algorithm if bit flipping was not the second hashing function or if more than two hashing A detailed derivation of the model appears in [12].. = u - SP(l) – SP(2)(1 + P(o) – P(1) – P(2)) 183 . . The ordy way to change this blt is if b[ai] were to be used aa a rehashed location in order to resolve a different cortilict. Thus. Let Cd denote the nmnberof cofilcting blocks in a direct-mapped cache. thii column-associative cache suffers thrashing if three or more conflicting addresses alternate. in direct-mapped caches. corttlicting blocks are blocks that conflict even after a rehash is attempted. in Figure 6. if the &ta referenced by ai already resides due to a conflic~ then the rehash bit of b[ai] must be a in f[ai] zero. and c=a~ the corresponding number of conflicting blocks in a cohtmn-associativecec e.ap I dom 3 ’[” w clobberZ + ewap done M+3 model for the cohtmn- al m b f 0 n I Cache I done M+l rahash bit (Rbit) I Figure 7: Decision tree for a column-associative cache. 11. 9]. associative cache that predicts the percentage of interference misses removed from a dwect-mapped cache using only one measured parameter--the size of the program’s working set—from so address trace. the percentage reduction in cache block cofllcts in the column-associative cache is captured by two parameters: S end u. h Blocks are said to conflict when multiple blocks horn the working set of a program map to a given cache set. its effect is less severe when we are interested in the r~ws of the number of conflicting blocks in direct-mapped caches and column-associative caches. ‘(~) = . because bit fl@ping is the rehashing fimctiom the ordy location for which this situation can occur is f[ai] itself. but the date referenced by a. ottr model yields valuable insights into the behavior of the column-associative cache. it is not to follow this On the other hand. The same assumption is also made for the rehash accesses. Of course. it must be proven that a third possibfity does not exis~ namely. Because the behavior is captured in a simple. it is clear that the two conditions for this third possibility can never occur sirmtltammusly..ht &ne b[a] miss --=-A. if a rehash is indeed attempted. that blocks have a uniform The mo&l makes the assumption probabtity of mapping to any cache se~ and that the mappings for different blocks are independent of each other. would automatically clobber the rehashed data. the third reference accesses b[a=]. possible for the placement of the date into condition. Therefore. However. and it estimates dte percentage of interference misses removed by computing the percentage of cache block cofllcts removed by the rehash algorithm. which is the probability that d program blocks (out of a total of u) map to a given cache set. because it contains the most recently accessed data. . ai aj a. The following blocks. it makes sense to replace the location reached dttring the tirat-time access that had its rehash bit set to one. Referring to the reference stream. though. then any first-time miss results in the immediate clobbering of that location and dte resetting of the rehash bit to zero. The parameter S represents the number of cache sets. This assumption is commonly made in cache modeling studies [10. . end must be measured fkom an address trace of a program. the product of S and the block sixe yields the cache size. The use of the rehash bit helps utilize cache area more efficiently because it immediately indicates whether a location is rehashed end should be replaced in preference over a non-rehashed location. The working set of a program is the set of distinct blocks a program accesses within some interval of time. actually resides in ~[ai] simtdtaneously. but this case is much less probable than two alternating addresses. Expressiona are derived in [12] for the number of conflicting blocks in direct-mapped end column-associative caches in terms of P(d). Section 5. as in ai aj a= @i @j az Ui .SP(l) . the location b[ai] has its rehash bit set to one. Vslidatiotts against empirically derived cache miss rates suggest that the model’s predictions are fairly accurate as well. In a column-associative cache. closed-form expression. the desired action. Like the hash-rehash cache.

the access time is one for hits. the miss rate is that of the particular cache design. The prdlctions for each of the i. The miss rate is the ratio of the number of misses to the total number of references.dvidusl traces is also fairly accurate. The percentage of interference misses removed is calculated by the equation direct direct miss miss rate miss cd .1 For hash-rehash caches.R2) + (M +3)(R2 . . the average memory access time is defined as the average number of cycles required to complete one reference in a particular eddress stream. and let M -+3 program. but they may have different interference miss rates. and (M+ 3) otherwise. where i is she fraction of instructions shat am. we plot in Figure 13 the measured values of the average percentages of interference misses removed and the values obtained using Equation 2 for our traces. If a rehash is not attempted. +3H2 + (M + l)(R . u-measured horn each trace. This metric is useful in stsaessing the performance of a specitic caching scheme because sdthough a particular cache design may demonstrate a lower miss rate than a d~ect-mapped cache. 3~e ~cle5 ~r ~5~&m (or tTI) assuming singk-cyck instmcsiort performance of victim caches. Finally. and HZ is the total number of hits on a secondtime access. h interference miss is defined as a miss that resulte when a block For column-associative cachea.HI . Table 3 shows the working set sizes for each of our traces. it may do so at ttte expense of the hit acceas tine. H1 is the total number of hits on a tirst-tirne access. The compulsory miss rate is the ratio of unique references to total references for tltat trace. end it derives average memory access rimes from the above equations.P(2)) cd It is instructive u .SP(l) to take the first-order approximations (2) rate miss rate – compulsory rate x 10070 of the ex- pression in Equation 2 after substituting for P(d) horn Equation 1 and sisnpl@ing the resulting expression. Similarly. t.Hz)] that was previously displaced born the cache ia subsequently referenced. then rhe average memory access time for the various schemes cam be computed from the decision trees of Section 3 as shown below. execution can be crdculstcd easily from the average access time. If R is the total number of references in the trace. and one plus M for misses. Our validation experiments indicate that ti is a good approximation. Thus. . In a single processor environmen~ the total number of misses minus the misses due to tirst-time references is the number of interference misses. loads or stores. (3) It is easy to see from the above equation that the percentage of conflicts removed by rehashing will approechunity as the cache size is increased. our graphs include access time results for both one-cycle and two-cycle swaps. and three for the Hz hits during a rehash attempt. conflict. and the sum of she capacity. roughly 50~o of the conflicts are removed when the cache is foor times larger than the working set of the representthe number of cycles required to service amiss from the To demonstrate the accuracy of the model. M = 20).We estimate the percentage of interference misses removed by the percentage reduction in the number of cofllcting blocks.I@] We use three cache performance metrics in our results: the cache miss rate. Thus. The analytical model uses only one parameter-the working-set size. . however.Ccac _ SP(2) (1 + P(o) – P(1) . main memory (in our simulations.) Thus the access time is one for fit-time hits. Proceeding along these. and the direct miss rate is that of a dwect-mapped cache of equal size. and 184 . for a given address trace and cache size. and context-switching misses its she terminology of Hitl and Smith [13]. the percentage of interference misses remov~ This metric is psrdcularly useful for determining g the success of a particular scheme because all cache irrtplementations must share the aarne compulsory or first-time miss rate for a given reference stream.H. 3 for rehash hits (Every first-time miss is followed by a rehash. Hosowitz. Both the model and the simulations use a block size of 16 bytes. lltepercentageof irtterfererscemisses removedis the percentage by which she number of interference misses in the cache under consideration is reduced over those in a dwect-mapped cache.3 t. We nose that our interference metric messures the sum of the intrinsic interference misses and the extrinsic interference misses in the classification of Agarwal.HI . 5 Results t ave = ~ [HI + (M + 1)(R . Therefore. + (M+3)(R . FKsg the metrics which have been used to evaluate the performance of the caches must be described. The first-order approximation is valid when S >> u and u >> L which~ow us to use (1 . RZ HI. = . which is the total number of second time accesses. and Hennessy [9]. In the presence of cache misses. (Recall that second-time accesses are attempted only when the rehash bit is zero.z 2A ~~ar pammeter was used by Jouppi [4] as a useful m-sure of tie The simulator described in the next section measures R. the percentage of interference misses removecL and the average memory access time.)] This section presents the data obtained through simulation of the various caches and en analysis of these results. For a unitied instruction and dsts cache wishs single cycte accesstime.. and Hz for each of the cache typea./S). For direct-mapped caches. 5. lines. as d~played in Figures 8 end 9. As mentioned earlier. . Rehash attempts that miss suffer a penalty of (M +3) cycles. we where. = . we need an additional parameter Rz. the CPI becomes (1 + l)tav~. [H. the access time is one for first-time hits.). Let the cache access time for a hit be one cycle. Cache Performance Metrics [M +3H.1/s)”-’ obtain = (1 . then (M + 1) cycles are spent. dte average scccss time becomes ta”e. she CPI widr a 100% hk sate is (1 + /).

then the percentage 185 .3145 LISPO and DECO.but there are several important differences.1 SPTCO IVEXO FORLO I trace description LISP runa of BOYER (a theorem prover) Behavioral simulator of cache hardware. 5‘fhe addition of one. so only LISPOresulta are plotted in Figure 8.775 358. since we explicitly subtract the number of fit-time misses and present the percentage of interference misses removecl and because it is possible to differentiate the performance of the various caching methods without resorting to measurement methods that yield cache miss rates with a degree of accuracy exceeding the tirst or second decimal place. The data for the SPICO ~acehaabeenplotted in Figure 9. unique blocks in the endre mce. It is successful as long as the the number of conflicts decreases ordy slightly as the cache size increases.418 2.1 SPICO IVEXO FORLO MUL6. Idsh-order bit to the index could separate two groups of addresses which conffict often because they differ for the first time in that bit. 5.reef the haah-rehashsrdas rate. Before introducing the plots. We also use a multiprogrammitr g trace called MUL6. a DEC program checking net lists in a VLSI chip FORTRAN compile of LINPACK g of interference misses removed becomes a negative quantity. derived from large programs running under VMS. Referring to the percentages of interference misses removed in Figure 8. It is evident that all of the cachedesignsexhibit much lower miss rates than the direct-mapped cache. setassociative. it does not attain the same percentages found for LISPO and DECO. which includes activity tiom six processes including a FORTRAN compile. 53. the victim cache lies around 25% or less for this metric. Each trace length is on the order of a half million references.000 reference windows. The five uniprocessor traces.D name LISPO DECO. If the miss rate of a cache happens to be worse than that of a direct-mapped cache for the particular cache size.698 0.1 The resultsfor the IJSPOand DECO. notice that the dashed curve corresponding to the predctiona of the model ia very close to the curve obtained from simulations.267 262.760 334. and FORLO The results for these traces are also similar enough to be grouped together.789 2.6808 0. a directory searck and a microcode address allocator. Because the victim cache is much more sensitive to working set size. However. which contains four words in these simulations. storing the tag. All caches are assumed to be combined insrrttction and data caches. a few of their features must be explained. as is occasionally the case for a hash-rehash cache. This effect makes sense intuitively-a hash-rehash cache is designed to resolve conflicts through the use of alternate cache locations. 4’fhis is why the points for rhe hash-rehash cache are not connected in the graphs showing percentageof intmfctwtcc missesremoved.O u tmique total compulsory miss rate (70) 392 463 740 774 826 1. victim hash-rehash. The ATUM traces comprise realistic workloads and include both operating system and multiprogramming activity. the results of the trace-driven simulations are plotted and interpreted. A sfriking feature of the miss rate plots is the relationship between the direct-mapped and hash-rehash caches. Table 3: Number of references (both instructions and data) and compulsory miss rate for eachof the addresstracessimulated. and the data block.168 307. The block size for measuring u and uraigue is set to 16 bytes (four words).7223 0. which is a phenomenon readily explained by the approximate analytictd expression for this metric: (1 – 2u/S). then the hash-rehmh algorithm clobbers many locations and suffers a sharp drop in the second-tirne hit rate because it is attempting to resolve cofllcts which no longer exists Notice that the column-associative cache does not suffer from this degradation because its access algorithm is designed specifically to alleviate the problems of clobbering and low second-time hit rates.0. IVEXO. and therefore the percentage of interference misses removed quickly approaches 1009o for all but the victim cache. while unique is dte total number of SPICO.172 314.2 Simulator and Trace Descriptions We wrote trace-driven simulators for dwect-mapped. The LISPO trace haa a small working set compared to the other traces (see Table 3). 5.3 Measurements and Analysis In this sation. Whenever doubling the cache size results in a aharp decrease in the dwect-mapped miss rate. of references Miss Rates and Interference Misses Removed @ace LISPO DECO.7912 3. if an increase in cache size itself suddenly removes a large portion of the confiicts.787 5.l. This is basedon simulation data which suggests that the removal of cofllcts quickly saturates beyond this size. The victim and hash-rehashcacheshave higher miss rates. The lowest miss rates are achieved by the two-way set-associativeand the column-associativecaches. We believe these lengths are adequate for our putposes.1 Table 2: Description of tmiprocessor traces used during simrtlation. In additioz remember that each victirncache entry is a complete cache line. Recall that the vicdrn cache sixe remains constan~ while the eohmm-associative and the set-associative caches cart devote larger weaa to resolve conflicts as cache size increases. DECSIM SPICE simulatitw a 2-irttmt tristate NAND buffer Interconnect verify.1traces are very similar. and column-associative caches.6097 2.834 11. status bita. the same change in cache size yielda a sharp and similarly sized increa. The compulsory miss rates and other parameters for these traces are listed in Table 3. u is the average number of unique blocks in 10. no.087 6. In the table. the percentages of interference misses removed by column associativity start at much lower vahtes and approach 1007o more slowly. the graph this is instead indicated by a point at zero percent? On The victim cache size has been set to 16 ertuies. Nearly all of the resultsfrom the previous sectionapply to the simulations with thesethreetraces. are described in Table 2. The traces used in this study come horn the ATUM experiment of Sites and Agarwal [14].1607 1. Multiprogrammed simulations assume that a process identifier is associated with each reference to distinguish between the data of dtierent processes. for these traces. Because the working-set sizes of these traces are larger than tie LISPO trace.110 400.

way s4M880c Vdmcde (16) Cdunm hoc .AMCC VH.“ -— —–—- Hash-RehP&I 2-way S9t-Anx IWm Cache (16) C4um AssOc Cdum Am. 186 .’ .0 6.’ --B G- d’ .h Z-way %2-Auoc W6m C2C8W (16) Gdum A8s0c ~ e F k . Block size is 16 bytes. + ++ * -A-9- cire51-h6wxd Ha2h-R6hzsh 2.~ ~ 8. + c41w+wQed Hnh-Wu. .7 .) 0.0 ‘ z IL ++ * + 9.90 (kdd) 0.’ 12487832S6 t28 266 Ceehe Size (K Blocks) Cache Size (K Blocks) Figttre 8: Miss rates and percentages of interference misses removed versus cache size for IJSPO.’ { #’ + + -5 -D- z-way SU. CaOhO(16) ~AScoc c&m-l ASSceOkdd) 25 ‘“’~ Ceche Size (K Blocks) 01 Figure 9: Miss rates and percentages of interference misses removed versus cache size for SPICO.’ . Block size is 16 bytes.8 I 0.6 P 1248163284 128 288 or. .Oof [P .

) The savings are smaller for larger caches. many addresses di&r corrsiderthe for the firat time in that bit. The improvement is about 0. Block sise is 16 bytes. This section presents data for each of the metrics averaged over all of the traces. uses five bits for the index. All three result in multiple eotilcts (thrashing) if only the four. The larger available area for storing ccmfiicts in the column-associative cache is clearly a big win in this situation.1 is similar to LISPO.4 Summary Two key factors must be eortsidered when interpreting the access time data. before settling toward the average compulsory miss rate of about 1. (This is why the corresponding curves are labeled “Ideal”). the cohmm-associative cache exceeds 0. and the two-way set associative cache saves a further 0. is the fact that the cohtmnsssociative cache achieves an average aeeess time close to the twoway set-associative cache. and it still exhibits thrashing. A hypothesis that explains this behavior is based on the fact that when comparing the two caches at an equal cache size.+ * * + -6- Drod-)+pd Hd-Roh2dl z-way s2t-A2Wc k4mC6cho (16) cdunnA220c ~ Cache Bize (K Bleeks) Cache B&e (K Eleeks) Figure 10 MKS rates and percentagesof interference missesremoved versuscachesize for MUL6. unlike the case for LISPO. All the average memory access time plots are largely similar in shape to the miss rate plots. Note that both caches have 16 sets. addresses 0001111. their average memo~ access times might welt become greater that those of column associative caches. FirsL although the average memory access times of set-associative caches are in reality increased due to their higher hit access times. If realistic access times of two-way set-sssociative caches are considered. The large working sets of multiprogr amming workloads make the fixed size of the victim cache a serious liability.3. A 32-set column-associative cache. This is a cache size of 24 or 16 for the eohmm-associative cache. doubling the cache aim and thus adding a high-order bit to the index may eliminate a large number of conflicts that have been occurring because Second the average memory aeeess time is very sensitive to the time required to service a miss (M). Once ag~ many of the observations made for the other trace results apply to MUL6. but the total cache size is 32 for the two-way set-associative cache.O. while IVEXO and FORLO are similar to SPICO. when the caches are less than 4K blocks. 187 .2 cycles over d~ectmapped caches.0%.O.5’3’0. The plots for SPICO in Figure 9 reveal another interesting facb the cohurm-associative cache outperforms the two-way setassociative cache for some of the cache sizes. the graphs in this paper assume their hit access times are the same as that of direct-mapped caches.O The miss rates tmd percentages of interference misses removed for the mtdtiprogr amming trace are plotted in Figure 10. the set-associative cache has only half that number of cache sets. For larger (and still reasonable) miss penalties. Perhaps the most telling result is the relatively poor performance of the victim cache. (With a 16-byte block the cache size is 64K bytes. For SPICO. 5. Its miss rate is virtually tie same as that of the direct-mapped cache (for cache sises greater than 2K blocks).3 cycles for most cache sizes. 5.H2) will look even more impressive than indicated by our results. For example. Perhaps most importan~ however. and 1011111.1 cycle. The results for the LISPO and SPICO traces arepresentedtogether m Figure 11.ve is a linear function of the miss rate.O are largely similar the column-associative cache saves about 0. The results assume M = 20 cycles.2 Average Memory Access Times The graph for LISPO shows rhat column sssociativity achieves much lower average memory access times than a direct-mapped cache. When the miss rates of all six traces are averaged for each cache size. the cofllcts between 0101111 and 1011111 are automatically eliminated because of the different fifth bhs. The resulting plots serve as excellent examples for reviewing the major points made in this seetion. even though the hit access time of the set-associative cache was (unrealistically) kept the same as that of a direct-mapped cache. The results for MUL6. This fact is contirrned when the miss rate plot is consi&red+e direct-mapped interference miss rate is not much higher than the compulsory miss rate. which is expec~ because t. As seen before. the designs such as wlurnn-associative caches which reduce the number of accesses to main memory (R .2 cycle improvements only for small caches. the plot in Figure 12 is the resul~ The direct-mapped miss rate is the baseline for comparison and fsUs quickly from 6. OIOllll. As before.H1 . low-order bits are used as the index. however. MUL6.0~0 to 2. In this case. DECO.

This paper proposed the design of a cdunn-associative cache 1S8 . The averagememory accesstime (t. The plot in Figure 13 showsthe averagepercentagesof interferencemissesremoved.0% to about 1. gsd Ii h ok4-MqQed ++Hd#rdmh * 2-lmy &r-Anm 4 W2mcsdW(16) -RcdumlAs80c + Csehe Ss0 12S 266 (K Blecks) Figure 12: MS tXSCOS. As predicted in Section 2.Block size is 16 bytes.W9sdl 2ScMuoctku) calttnn AuOc(l wmsdo@) cdum&c0c(2 wmtDdcyd0*) ~2 . Basedon this average plot and on most of the other daa the column-associative cache appearsto be good choice undermost operating conditions. Thehssh-rehssheacheis usually snimprovemerttupon direct-mapped caching themissratedrops more quickly fium6.%vCo tSS. Two previous solutions which attempted to achieve this are the hash-rehash cache and the victim cache. The main metrics used to evaluate cache performance have been the miss rate and average memory access tim~ unfortunately. at the Transitionpoin~ the hash-rehash miss rate increasesabout as much as the direct-mapped miss rate decreases. rates versus cache size. fully-associative buffer and its lack of robust performance (in terms of its miss rate) as the size of the primary cache increases. 6 Conclusions The goaI of this research has been to develop area-efficient cache control algorithms for improved cache performance.ISr~d Huh. the dashedcurve for the model is seento be surprisingly close to simulation results when the individual trace anomalies are averaged out. The ra=h eccessesporformedby the hash-rehashalgorithm now are more likely to clobber live data than to resolve conflicts. Finally. Dnd-~d Hs. The miss rates of thesecachesare almost 2. The victim cachedoesnot suffer from this effec~ becauseit is designed to alleviate the main problems with the hash-rehash algorithm clobbering and low second-time hit rates.0% near the transitio~ and right at the compulsory miss rate for large caches. The hit mess time cache. However. The fitat gtuup contains the hashrehaah cache and the victim cache. to”~ is reduced by over 0.and by about 0. minkking one of them usually ei%xts the other adversely.0% lower then direct-mappedmiss rates for small caches. the interference missrate drops markedly.MQIuA 2-wq ad.&sOs (kid) cOkmAuOc[lw-c+’dO) c4iulmAu0sls wsstodcyd@ ~ \ + 12481632S4 124 SS6 1..This is due to the fact that once the cachesize exceeds the working-setsize. The drawbacks of the victim cache include the need for a large. the performance of the victim cache relative to the column-associative cache degrades with cache size. As prediited in Section 3. k this example. (K Blods) CdJS S&e (K Blooks) LIS$O t- Fimre 11: Average memory accesstimes CmcYcles)versuscachesize for I-IWO and SPICO. Cscho . The second group consists of the two-way set-associativeand the column-associative caches. the colunmassociativecacheachievestwo-way set-associativemiss rates. which have similar cmtml algoriduns. (This averagedoesnot rnclude the MUL6. the success of the hash-rehash cache is very erratic and is hampered by clobbering and low second-time hit rates.=) date for the six traces havebeenaveragedrutdplotted in Figure 14. averagedover w six Block Si2e is 16 bytSS. just under 1. The optimal cache design would remove interfereneo misses es well as a two-way setassociative cache but would maintain the fast hit access times of a direm-mapped cache.) The -es for set-associativeand column-associative cachesare almost identicd stardttg at about 40% and climbing to 100% when the cache size reaches 2S6 K blocks.O numbers.1 cycles for moderate to large caches. basednot only on their similar miss rate curves but also on the relationships among theii accessalgorithms. of two-way set-associative caches is assumed to be the same as that of a direct-mapped The other cache designs can be split into two groups.Yqo. so that we could compare the simulation averageswith the model. Although some performance gain is achieved by both these schemes.2 cycles for small to moderate caches.

Hfl. [8] Kimming So and Rudolph N. [3] Mark D. [4] Norman P. and sdmost negligible if the state represented by therehashbit caddbe encededinto the existing status bits of many practical cache designa. pages 234-241. [6] Matthew K. Watson Research Center. Cache Performance of Operating Systems and Multiprogramming. [9] Ansnt Agsrwsl. 21(12}25-40. Akin [7] R. Improving Perforsnartce of Small On-Chip Instruction Caches. Technical Report RC 11613. IBM T. pages 131-139. [2] Steven Przybylsu Mark Horowiz and John Hennessy. Finally. 14(4}473-530. the hardware costs of implementing this scheme are minor.J. [5] Anant AgarwaL John Hermessy.100 - that has the good hit access time of a direct-mapped cache and the high hit rate of a set-associative cache. November 1985. 6(4>393-431. -!+ Wsmedn (16) eah8m&wJc c+hwlk81JeiwcdalJ 80 - eo - u )s o- I% 128 Cdu B&e 226 two-way set-associative cache. Rechtachsffen. Improving Direct-Mapped Cache Performance by the Addition of a Small FuUy-Associative Cache and Prefetch Buffera. In Proceedings of the 16th Annual Sywposium on Computer Architecture. In Proceedings of the 17th Annucd Symposiumon ComputerArchitecture. Trace-driven simulations cdirrn that the cohsmn-associative cache removes almost as many interference misses es does the + +3. A Case for Direct-Mapped Caches. Mark Horowitzt and John Hennessy. September 1982. # MIP-9012773 References [1] Alan Jay Smith. Cache Memories. on Computer System. Cache Operations by MRU Change. December 1988. In Pro- ceedingsof the 16th Annual Symposiumon Computer Architecture. IBEE Computer Society Ress. August 1990. May 1989. Fsrrens and Atdrew R. In additi~ the average memory access times for this cache are close to that of an i&al two-way set-associative cache. avh?beck and Mark D. pages 364-373. IEEE Computer Society Press. Block size is 16 bytes. Jouppi.7(2):18+215. Hill. Block size is 16 bytIX. Pleszkun. IEEE Computer Society Press. The research reported in this paper is funded by NSF grsmt Figure 13: Percentages of interference misses removed versus cache size. Kessk. The key aspect which distinguishes the cehttmt-associative cache is the use of a rehsshbit to indicate whether a cache set is a rehashed locadon. June 1988. eraged over all six traecs. The fundstnentsd idea behind cohutm aasociativity is to resolve conflicts by dynamically choosing different locations in which the conflicting data canreside. of Set-Associativity. ZEEE Computer. pages 290-298. averaged over the single process traces. May 1989. and Mark Horowitz. Computing Surveys. even when access time of the two-way setsssociative cache is assumed to be the same as that of a directmapped cache. Inexpensive Jmplementstions Figure 14: Average memory access times versus cache size. In Proceedings of the 15th Annual Symposium on Conputer Architecture. ACM Trormctwnr November 1988. 189 . The hit access time of two-way set-associative caches is assumed to be the same as that of a direct-mapped cache. (K Blocks) 7 Acknowledgments and DARPA contract # NOO014-87-K-0825. ACM Transactwns on Computer Systems. Performance Trsde@s in Cache Design.. IEEE Computer Society Press. May 1989. An Analytical Cache Model.E. Rlchsrd JOOSS.

State fl [a] also moves the datum accessed the first time horn the swap buffer to the data ImiTer (by asserting DSEL and LD). March 1978.&r w o 1 u LM i Ceatrol I B-D”” Figure ‘552A% cache implementation.RQ~ MEM. f2[a]. Analysis using ATUM. J.D. Technical ReportLCS TM 484. Multiprocessor Cache In Prmeedings of the 15th InternatwnalSymposiumon ComputerArchitecture. Jtt order to accomplish the swap of conflicting&@ a data buffer ia required. A Comparative study of Set Associative Mem~ Mapping Algorit%s And Their Use for Cache and Main Memory. then a rehash is to be performed.. however. Stone. MSEL and LM are asserted to load the MAR with ~[a]. WT DSEL.5(4>305329. IEEE Transoctwns on Conquters. if there is a tirst-time miss and the rehash bit of this location is a one. WT IDLE ~ fl[a] XV/m f2[a] Wml IDLE Wm JDLE IDLE MSEL MSEL !H~T. A second dme miss is handled similarly by states WAITl and Wexcept that the correct datum to be swapped into the non-rehashed location comes from the memory. AU buffers are assumed to be edge triggered. fl[a] 1 HIT !HIT t2[a] WAIT1 WAIT2 XWAIT MACK MACK DSEI+ LD LD. IEEE. WT LM. Sites and Anant Agarwal. Column-Associative [12] Anant Agarwal and Steven Pudar. RD sTALLMsELJ&f. SEAI(2)121-130. IEEE Transactwns on Sojiware Engineering. so that it may read data fium either tlte swap buffer or the data bus. Evaluating Aasociativity in CPU Caches. On the other hand if the tirst location is not rehashed (! HB). pages 186-195. March 1993. STALL MSEI+ LM. New York June 1988. Footprints m the Cache. [11] Dominique Thiebaut and Harold S. First-time hits proceed as m direct-mapped caches. Every r 012 RD/w’f I m Array + state IDLE rnput output next state b[a] rahmh hLt bral — .—J I !HIT. Caches: A Techrdque for Reducrng the Miss Rate of DirectMappedCaches. [10] Man Jay Smith. LS is asserted to move the first datum into the swap buffer. Smith. ~(~(a)). Since the rehashing function used is bit flipping. [13] M. and the state changes to fl[a]. Hill end A. a rehash bit is added to each cache seq when this bit is read out into the data buffer. then the column-associative algorithm requires that this location be replaced by the data from memory. and issues a write (WT). When tbe memory acknowledges completion (MACK). the data is taken off the data bus (LD) and written back into the cache (WT). OP HIT !HB LM. state fl[a] loa& the MAR wirh the original address. WT 15: Cohmm-associadve cache set must have a rehash bit appended to it. In or&r to perform the swap. it signal. [14] Richard L. In constructing the state flow table all cache accesses are assumed to be reada. ACM Transoctwm on Computer Systems. 38(12)16121630. An n-bh multiplexer can be used to switch the current contents of the memory address register (MAR) between the two conflicting addresses. the functional block ~(z) is simply an inverter. If there is a second-time hi~ then the correct datum resides in the data buffer. LD MSEL LM. which is accomplished in the XWAIT state. WT LM. December 1989. MIT. the second-time access is begun (RD). November 1987. In sozm implmnemanom the rehash rhen seines= a muol state can be encoded using the existing state bits associated with each cache line. thus eliminating the need for an extra bit. where it can be written back into the rehashed location in the next state. A Cache Implementation Example The datapaths required foracohunn-associative cache are displayed in Figure 15. Finally. is stalled (STALL). Table 4: Stak flow table for the mntml logic of a ccdunutassociative cache. The processor 190 . A MUX is also needed at the input of the data buffer.