You are on page 1of 7

Hierarchical Multiple Associative Mapping in Cache Memories

Hamid R. Zarandi, Seyed Ghassem Miremadi


Department of Computer Engineering, Sharif University of Technology
E-mails: zarandi@ce.sharif.edu, miremadi@sharif.edu

Abstract This paper introduces a new placement scheme for


cache memories based on a variable associativity degree.
In this paper, a new cache placement scheme is This scheme is a generalized version of the HBAM cache
proposed to achieve higher hit ratios with respect to the (Hierarchical Binary Associative Mapping), which is
two conventional schemes namely set-associative and previously introduced in [22]. In this scheme, using a
direct mapping. Similar to set-associative, in this scheme, division parameter k, cache space is divided into sets of
cache space is divided into sets of different sizes. Hence, different sizes, similar to set-associative one, but
the length of tag fields associated to each set is also organized in a hierarchical structure where the size of the
variable and depends on the partition it is in. The set at a given level is k times larger than that of the set in
proposed mapping function has been simulated with some the next level of hierarchy. Thus, this new scheme is
standard trace files and statistics are gathered and called Hierarchical Multiple Associative Mapping
analyzed for different cache configurations. The results (HMAM). Unlike set-associative mapping with a fixed
reveal that the proposed scheme exhibits a higher hit modulo-based address translation from CPU address into
ratio compared to the two well-known mapping schemes, cache sets, HMAM uses an address translation function
namely set-associative and direct mapping, using LRU that behaves based on a variable modulo system. This
replacement policy. Also, its area and power consumption characteristic leads to increase hit ratio and to decrease
is less than full-associative scheme. both area and power consumption regarding to the fully-
associative caches.
The remainder of the paper is organized as follows.
1. Introduction Section 2 presents some related work. Section 3 describes
the proposed placement scheme, HMAM. In Section 4,
In high performance computer systems, bandwidth of the cost and power consumption issues are discussed.
the memory is often a bottleneck because it plays a Performance evaluation of HMAM has been presented in
critical role in affecting the peak throughput. The use of Section 5. Finally, Section 6 concludes the paper.
cache is the simplest cost-effective way to achieve higher
memory bandwidths. The performance of the cache 2. Related work
depends on several factors such as cache size, block size,
mapping function (placement method), replacement Sometimes a direct-mapped cache has a high miss rate,
algorithm, and write policy [8, 18]. Researchers have resulting in higher memory access time. Increasing cache
proposed many schemes and algorithms for placement associativity can decrease the cache miss rate and hence
and replacements of lines in cache memories to improve memory access time. For example, the average miss rate
performance of the caches and to decrease their costs [5, for the SPEC92 benchmarks is 4.6% for a direct-mapped
6, 7, 8, 13, 14]. Moreover, increasing cache associativity, 8Kbyte cache, 3.8% for a 2-way 8Kbyte cache and only
which decreases the number of conflicts or interference in 2.9% for 4-way 8Kbyte cache [8]. Though these
references, decreases miss rates. For example, in set- differences may appear small, they in fact translate to big
associative scheme the miss rate of the cache decreases as performance differences, due to the large penalty cycles
the size of sets grows; the column-associative cache [2] of misses [25]. Higher associativity degree is important
and the predictive sequential associative cache [5] were when the miss penalty is large and when memory and
proposed to achieve near-optimal performance for an memory interconnect contention delays are significant or
associativity degree of two. Therefore, increasing sensitive to the cache miss rate. Both situations may occur
associativity degree beyond two and improving placement in shared-memory multiprocessors [12]. A uniprocessor
algorithms are important ways to further improve cache also may have a large miss penalty when it has only a
performance [7]. first-level cache and the gap between processor and

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
memory speed is large. Increasing associativity also has with a hierarchy level of h has associativity of
the advantage of reducing the probability of thrashing. C / k h blocks. The first k-1 sets have the largest size of
Repeatedly accessing m different blocks that map into the C/k blocks while the last k sets contain a minimum of 1
same set will cause thrashing. A cache with an block.
associativity degree of n can avoid such thrashing if In this scheme, k is a parameter to adjust the
n t m and LRU replacement policy is employed [26]. A associativity used in the cache. In other words, if k is 1
hash-rehash cache [1] uses two mapping functions to then the HMAM cache is a fully-associative cache and if -
determine the candidate location with associativity of two k is C then the HMAM cache is a direct-mapped cache.
and by sequential search, but higher miss rate results from Also, the HBAM cache [22] is a HMAM cache with k of
non-LRU replacement. Agarwal et al. [2] proposed the 2. For typical size of k (i.e., 2, 4, 8, 16), the cost of this
column-associative cache that improves the hash-rehash scheme is less than the cost of fully-associative scheme.
cash by adding hardware to implement LRU replacement Due to the separated logic associated to each set, its
policy. The predicative sequential associative cache operation is relatively faster compared to fully-associative
proposed by Calder et al. [5], uses bit selection, a mapping and slower than set-associative mapping [23,
sequential search and steering bit table, which is indexed 24]. In this scheme, address translation is performed
by predictive sources to determine search order. However dynamically (value based) instead of statically (bit
this approach is based on prediction, which may be selection) in the direct mapping scheme. This means that
incorrect and has slightly longer average access time. there is no predefined specified format to determine set
Skewed-associative cache [17] increases associativity in number of a memory address in the cache. The set
orthogonal dimension using skewing function instead of number should be determined using the address pattern
bit selection to determine candidate locations. The major coming from the CPU. As an example, Figure 1 shows
drawbacks of this scheme are a longer cycle time and the the set organization of a 16-block HMAM cache with k
mapping hardware necessary for skewing. equal to 4 (4HMAM), for a main memory of 128 blocks.
Ranagathan [16] proposed a configurable cache Table 1 portrays the address mapping and required bits
architecture useful for media processing which divide the for tag storage in the 4HMAM cache where its block size
cache into partitions at the granularity of the conventional b
is 2 words. The number of sets in 4HMAM cache is
cache. The key drawback of it is that the number and
granularity of the partitions are limited by the 3 log C4  1 .
associativity of the cache and also it causes to modify the Block Tag
hardware of the cache to support dynamic partitioning Block 0
and associativity. Another configurable cache architecture
Block 1
has been proposed in [25], which intended for some Set# 1 4m + 3
Block 2
specific applications of embedded systems. The cache
mapping function can be configured by software to be Block 3
direct mapped, 2-way, or 4-way set-associative caches. Block 4
Block 5
Hierarchy Set# 2 4m+2
3. HMAM Organization level# 1 Block 6
Block 7
In the HMAM organization the cache space is divided Block 8
into several variable size associative sets in a hierarchical
Block 9
form. Let C be the cache size in blocks and k be division Set# 3 4m+1
Block 10
factor used for dividing the cache space. In HMAM, the
cache space is divided into k different sets with numbers Block 11

of 1, 2, …, k-1, k, with associativity of C/k located in Set# 4 Block 12


hierarchy level of 1. However the last set i.e., k-th set, is Hierarchy Set# 5 Block 13
level# 2
then divided to k different associative sets in hierarchical Set# 6 Block 14
level of 2 and with number of k, k+1, …, 2k-1. Now the Set# 7 Block 15
last set i.e., 2k-1 is then divided to k associative sets. This
Figure 1. The 4HMAM organization of a 16-block cache in
procedure is performed until the divided sets consist of a system with a 128-block main memory
only one block each. Hence, in HMAM cache, the cache
size C should be power of k.
4. Performance Analysis
In this scheme, size of sets varies in power of k and
number of sets is ( k  1) log Ck  1 . Each set in this scheme The cache simulator in [4] is modified to simulate the
proposed HBAM cache system. The modifications

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
Table 1. Address mapping and required bits for tag in each set
Logical condition in address Bit length of
Set # Tag storage bits Associativity in set
decoder tag array
1 Ab1 Ab 1 An1downto Ab2 nb2 C/4
2 Ab1 Ab 1 An1downto Ab 2 nb2 C/4
3 Ab1 Ab 1 An1downto Ab 2 nb2 C/4
4 Ab3 Ab2 Ab1 Ab 1 An1downto Ab4 nb4 C/4
… … … … …
3 log4 C  3 A b  2 log C4 3 Ab  2 log C 4 ... Ab 1 An1downto Ab 2 logC  2 n  b  2 (log4 C  1) 4
4 4

3 log4 C  2 Ab  2 log C 1 Ab  2 log C 2 ... Ab 1 An1downto Ab 2 logC n  b  2 log4 C 1


4 4 4

3 log4 C 1 Ab  2 log C 1 A b  2 log C4  2 ... Ab 1 An1downto Ab 2 logC n  b  2 log4 C 1


4 4

3 log4 C A b  2 log C4 1 Ab  2 log C  2 ... Ab 1 An1downto Ab 2 logC n  b  2 log4 C 1


4 4

3 log4 C  1 A b  2 log C4 1 A b  2 log C4 2 ... Ab 1 An1downto Ab 2 logC n  b  2 log4 C 1


4

needed for upgrading the simulator consist were: penalty. The basic parameters for the simulation are: CPU
modification of set determination function, which was clock =200 MHz, Memory latency=15 CPU cycles,
used in set-associative cache, and development of a new Memory bandwidth=1.6 Gb/s. The hit time is assumed to
function that finds number of associative lines and tag be 1 CPU cycle. These parameters are based on the values
bits according to the obtained set. Benchmarks used in for common 32-bit embedded processors (i.e., ARM920T
this trace-driven simulation included several different [3] or Hitachi SH4 [9]).
kinds of programs of SPEC2000 benchmarks [19], The average memory access time for the conventional
namely bzip2, apsi, swim, vortex, eon_cook, eon_rush, fully-associative, direct-mapped and the various HMAM
gcc, gzip, parser, sixtrack, and vpr. Each file contains at cache are shown in Figure 3. As a result of benchmark
least 10M references. Both data and instruction references analysis, application with high degree of locality like gzip
are collected and used for the simulation. Three well- shows particularly higher performance improvement in
known placement schemes, i.e., the direct, set-associative, using HMAM cache. As shown in these figures, when k is
and fully-associative mapping are chosen for performance relatively low, HMAM behaves very closely to the fully-
comparison and evaluation. associative cache in miss ration and the average memory
Two major performance metrics i.e., the miss (hit) access time instead of the conventional set-associative
ratio and the average memory access time are used to cache.
evaluate and compare the HMAM cache with other In the case of simulating the k-way set-associative
schemes. cache, several values of k, have been considered. For
The cache miss ratios for the conventional fully- brevity, only a selected figure is shown here. Figures 5
associative (FA), 4-way set associative (4WSA), direct- shows the plot of hit ratio against block size for a cache
mapped (DC) and the proposed HMAM cache with size of 32 KB and typically selected benchmarks, bzip2. It
several values of k are shown in Figure 2. For the fully- illustrates comparative performance of the conventional
associative cache denoted as FA in the figure, the notation placement schemes (fully-associative, set-associative and
“32k-8byte” denotes an 32KB full-associative cache with direct mapping) and the HMAM scheme. Notice that
a block size of 8 bytes. there is a general trend of the HMAM scheme exhibiting
Notice the average miss ratio of the HMAM cache for higher hit ratios (except for fully-associative scheme). It
a given size (i.e., 32KB) is very close to the FA. The can be seen, in figures 5, that the HMAM scheme
HMAM cache is approaching to 4WSA when k is grown. outperforms the set-associative and direct mapping
Another useful measure to evaluate the performance of schemes for a wide variety of cache configurations.
any given memory-hierarchy is the average memory
access time, which is given by 5. Cost and Power Consumption Analysis
Averagememoryaccess time (1) 5.1 Hardware Complexity
In order to reduce latency of tag comparison in fully-
Hit time  Miss rate . Miss penalty associative caches, these memories are constructed using
CAM (content addressable memories) structures. Since
Here hit time is the time to process a hit in the cache each CAM cell is designed as a combination of storage
and miss penalty is the additional time to service the miss

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
Figure 2. Miss ratio (%) of fully-associative, several HMAM and direct mapped cache for various benchmarks

Figure 3. Average memory access time in cycle for fully-associative, several HMAM and direct mapped cache using various benchmarks

Table 2. Performance and cost of direct-mapped and 4-way


set associative, 2HMAM, 8HMAM caches
Avg.
Cache Avg. miss memory
Area (rbe)
configuration ratio (%) access time
(cycles)
1K-8B (DC) 8168.10 46.87 5.6944
2K-8B (DC) 15491.10 40.34 6.4112
4K-8B (DC) 29840.35 33.82 7.4544
8K-8B (DC) 57934.45 29.34 8.4992
2K-8B (4WSA) 16125.90 35.64 6.7024
Figure 4. Miss ratio vs. block size for 32K Byte cache size 4K-8B (4WSA) 31089.50 30.21 5.8336
and bzip2 benchmark 1K-8B (2HMAM) 8957.36 38.15 7.1040
8K-8B (2HMAM) 69635.27 22.02 4.5228
2K-8B (4HMAM) 17385.71 32.27 6.1632
4K-8B (8HMAM) 34218.59 28.39 5.5424

and comparison logic, the size of a CAM cell is about RAM = 0.6 ¬ª#entries + #Lsense amplifiers ¼º˜> (#data bits + #statusbits) +Wdriver @
double as that of a RAM cell [15]. For fair (2)
Performance / Cost analysis, the performance and where #entries is the number of rows in tag array or
cost for various direct, set-associative and HMAM caches data array, #Lsense amplifiers is the length of a bit-line sense
are evaluated. The metric used to normalize cost-area
analysis is rbe (register-bit equivalent). amplifier, #data bits indicates the number of tag bits
We use the same quantities used in [8] [10], where the (or data bits) of one set, #status bits is the state bit of
complexity of PLA (programmable logic array) circuit is one set, and Wdriver is the data width of a driver.
assumed to be 130 rbe, a RAM cell as 0.6 rbe, and a CAM
cell as 1.2 rbe. The area of CAM can be given by [15]
The RAM area can be calculated as [15] CAM = 0.6 ª¬ 2 ˜ #entries + #L sense amplifiers º¼ ˜ ª¬ 2 ˜ # tag bits+ Wdriver º¼

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
(3) where N bp , N bw , and N br are the total number of
where # tag bits is the number of bits for one set in the
transitions in the bit lines due to precharge, the number of
tag array. writes, and the number of reads, respectively. B is the size
The total area can be given by of block in bytes, T is the tag size of one set, and S
Area = RAM + CAM + PLA (4) denotes the number of status bits per block. Cg,Qpa ,
The area of HMAM cache was calculated by assuming
that it is composed of several fully-associative caches, Cg,Qpb , and Cg,Qp are the gate capacitance of the
each of which has its specified size and tags. Table 2
transistor CQ . Finally, C bp and C ba are the effective load
shows Performance / Cost for various cache sizes.
According to the table 2, the HMAM cache (8HMAM, capacitance of each bit line during pre-charging, and
4KB with 8-byte block size) shows about 40% area reading/writing from/to the cell. According to the results
reduction compared to the conventional direct-mapped reported in [20], we have
cache (DC, 8KB with 8-byte block size) while showing C bp = N rows ˜ ª¬ 0.5 ˜ C drain,Q1 + C bitwire º¼ (8)
almost equal closed performance gains. Moreover, higher
C ba = N rows ˜ ª¬ 0.5 ˜ C drain,Q1 + C bitwire º¼ + C drain,Q p + C drain,Q pa
performance for HMAM scheme may be achieved by
increasing the size of caches, compared to direct mapping (9)
schemes. where Cdrain,Q1 is the drain capacitance of transistor Q1 ,
and C bitwire is the bit wire capacitance of a single bit cell.
5.2 Power Consumption
Energy dissipation in CMOS integrated circuits can be 5.2.2. Energy dissipated in the word lines. Eword is
mainly caused due to charging and discharging gate energy consumption due to assertion of a particular word-
capacitance. The energy dissipated per transition is given line; once the bit-lines are all precharged, one row is
by [8] selected, performing read/write to the desired data. Eword
E t = 0.5 ˜ Ceq ˜ V 2 (5) can be calculated as [20]
To obtain the values for the equivalent capacitance, E word = V 2 ˜ K˜ > N hit + Nmiss @ ˜ >8B+ T+ S@ ˜ ª¬2Cgate,Q1 + Cwordwire º¼
Ceq, of components in the memory subsystem, we follow (10)
the model proposed by Wilton and Jouppi [20, 21]. Their where C wordwire is the word-wire capacitance. Thus,
model assumes a 0.8um process and a supply voltage of
3.3 volts. To obtain the number of transitions that occur at C wordwire = N column ˜ ª¬ 2 Cgate,Q1 + C wordwire º¼ (11)
each transistor, the model introduced by Kamble and
Ghost [10, 11] is adapted here. According to this model,
the main sources of power are determined to be the 5.2.3. Energy dissipated at the data and address
following four components: Ebits, Eword, Eoutput, and output lines. E output is the energy used to drive external
Einput. These notations denote the energy dissipation for buses; this component includes power consumption for
bit-lines, word-lines, address and data output lines, and both the data sent or returned and the address sent to the
address input lines, respectively. The energy consumption
is then given by lower level memory based on a miss request. E output can
E cache = E bits + E word + E output + E input (6) be calculated as
E output = E addr output + E data output (12)
5.2.1. Energy dissipated in the bit lines. Ebits is the where E addr output and E data output are the energy dissipation
energy consumption of all the bit-lines when SRAMs are
at the address and data lines, and are given by
accessed; it is due to pre-charging lines and reading or
writing data. It is assumed that the tags and data array in E addr output = 0.5 ˜ V 2 ˜ N addr output ˜ Caddr out (13)
the direct-mapped cache can be accessed in parallel. In
order to minimize the power overhead introduced in fully
E data output = 0.5 ˜ V 2 ˜ N data output ˜ Cdata out (14)
associative caches, first a tag look-up is performed and where N addr output and N data output are the total number of
the data array is then accessed only if a hit occurs. In a K-
transitions at the address and data output lines,
way set-associative cache, the Ebits can be calculated as
Ebits =0.5˜V ˜ª¬NbpCbp+K(N
2
˜ hit +Nmiss)˜(8B+T+S)˜(Cg,Qpa +Cg,Qpb +Cg,Qp )+NbwCba +NbrCba º¼
respectively. Caddr out and Cdata out are their corresponding
capacitive loads. The capacitive load for on-chip
(7)
destinations is 0.5pF and for off-chip destinations is

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
FA 2HMAM 4HMAM 4WSA DC
Table 3. Capacitance values

Normalized power consumption


1
Cdrain,Q1 2.737 fF
Cbitwire 4.4 fF/bitcell 0.8
Cdrain,Qp 80.89 fF
Cdrain,Qpa 80.89 fF 0.6
Cgate,Q1 0.401fF
Cwordwire 1.8 fF/bitcell 0.4
Cg,Qp 38.08 fF
Cg,Qpa, 38.08 fF 0.2
Cg,Qpb
Caddroutput 0.5pF(on-chip) 0
Cdataoutput 20 pF(off-chip) bzip2 sixtrack gcc vortex avg.

Figure 5. Power consumption of fully-associative, 2HMAM,


4HMAM, set-associative and direct-mapped caches

20pF[14]. 6. Conclusions
5.2.4. Energy dissipated at the address input lines. We have presented the generalized version of the
E input is the energy dissipated at the input gates of the HBAM (Hierarchical Multiple Associative Mapping)
cache. It is similar to set-associative mapping scheme and
row decoder. The energy dissipated at the address called Hierarchical Multiple Associative Mapping
decoders is not considered, since this value turns out to be (HMAM). The simple version of this approach is when
negligible compared to the other components [8]. the k is two as HBAM cache. Results obtained using a
The actual values of different factors of power trace-driven simulator for different scenarios reveal that
dissipation are obtained by using the above-mentioned HMAM can provide significant performance
equations and assumptions. Table 3 shows the various improvements with respect to traditional schemes. The
capacitance values. cost and power consumption of HMAM is less than the
To obtain the actual power consumption, the tag array cost and power of fully-associative scheme. The division
of each section in the proposed cache must be considered parameter k can adjust the power and cost of the HMAM
as a CAM structure. By considering the proposed cache as cache close to the conventional set-associative and direct
a collection of several set-associative caches with mapped caches.
different tag width, these values were obtained. Now the
power consumption of the proposed cache can be
compared with that of the associative cache. Figure 5 7. References
presents the power consumption of the fully-associative,
[1] Agarwal A., Hennessy J., Horowitz M., “Cache
HMAM, 4-way set associative and direct-mapped caches
Performance of Operating Systems and
with the same cache size. As shown before, the fully- Multiprogramming,” ACM Trans. Computer Systems,
associative cache can achieve slightly better performance Vol. 6, No. 4 , 1988, pp 393-431.
gain compared to the HMAM cache, but the aspect of [2] Agarwal A., Pudar S. D., “Column-Associative Caches: a
power consumption can provide a significant effect. Bit- Technique for Reducing the Miss Rate of Direct-Mapped
lines of large block size and a large number of content Caches,” Int’l Symp. on Computer Architecture, 1993,
swaps influence the high power consumption for the pp. 179-190.
fully-associative cache compared to the HMAM cache. [3] ARM Company, “ARM920T Technical Reference
Thus it is shown that power consumption of the 4HMAM Manual.,” http://www.arm.com
cache is reduced by about 5-22% when comparing to that [4] Brigham Young University, “BYU Cache Simulator,”
http://tds.cs.byu.edu
of the fully-associative cache configuration. It should be [5] Calder B., Grunwald D. “Predictive Sequential
noted that power consumption of HMAM cache highly Associative Cache,” Proc. 2nd Int’l Symp. High
depended to its division parameter, k. As this parameter performance Computer Architecture, 1996, pp. 244-253.
grows the power consumption is approaching to the [6] Chen H., Chiang J. “Design of an Adjustable-way Set-
2WSA and DC. Associative Cache,” Proc. Pacific Rim Communications,
Computers and signal Processing, 2001, pp. 315-318.

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE
[7] H. Khalid, “KORA-2 Cache Replacement scheme,” [26] Zhang C., Zhang X., Yan Y., ”Two Fast and High-
Proceedings of the 6th IEEE Electronics, Circuits and Associativity Cache Schemes,” IEEE micro, 1997, pp. 40-
Systems, ICECS’99, vol. 1, pp. 17-21, 1999. 49.
[8] Hennessy J. L., Patterson D. A., “Computer architecture
Quantitative Approach,” 2nd Edition, Morgan-Kaufmann
Publishing Co., 1996.
[9] Hitachi Company: SH4 Embedded Processor.
http://www.hitachi.com
[10] Kamble M. B., Ghose K., “Analytical Energy Dissipation
Models for Low Power Caches,” Proc. of Intl. Symp. on
Low Power Electronics and Design, 1997, pp. 143-148.
[11] Kamble M. B., Ghose K., ”Energy-Efficiency of VLSI
Cache: A Comparative Study,” Proc. IEEE 10th Int’l.
Conf. on VLSI Design, 1997, pp. 261-267.
[12] Kessler R. R., et al., “Inexpensive Implementations of
Associativity,” Proc. Intl. Symp. Computer Architecture,
1989, pp. 131-139.
[13] Kim S., Somani A., “Area Efficient Architectures for
Information Integrity Checking in the Cache Memories,”
Proc. Intl. Symp. Computer Architecture, 1999, pp. 246-
256.
[14] Lee J. H., Lee J. S., Kim S. D., “A New Cache
Architecture based on Temporal and Spatial Locality,”
Journal of Systems Architecture, Vol. 46, 2000, pp. 1452-
1467.
[15] Mulder J. M., Quach N. T., Flynn M. J., ”An Area Model
for On-Chip Memories and its Applications,” IEEE
journal of solid state Circuits, Vol. 26, 1991, pp. 98-106.
[16] Ranganathan P., Adve S., Jouppi N. P. “Reconfigurable
Caches and their Application to Media Processing,” Proc.
Int. Symp. Computer Architecture, 2000, pp. 214-224.
[17] Seznec A., ”A Case for Two-Way Skewed-Associative
Caches,” Proc. Intl. Symp. Computer Architecture, 1993,
pp. 169-178.
[18] Smith A. J. “Cache memories. Computing Survey,” Vol.
14, No. 4, 1982, pp. 473-530.
[19] Standard Performance Evaluation Corporation, SPEC
CPU 2000 benchmarks.
http://www.specbench.org/osg/cpu2000
[20] Wilton S. J. E., Jouppi N. P., ”An Enhanced Access and
Cycle Time Model for On-chip Caches,” Digital WRL
Research Report 93/5, 1994.
[21] Wilton S. J. E., Jouppi N. P., “CACTI: An Enhancement
Cache Access and Cycle Time Model,” IEEE Journal of
Solid-State Circuits, Vol. 31, 1996, pp. 677-688.
[22] Zarandi H., Sarbzai-Azad H., “Hierarchical Binary Set
Partitioning in Cache Memories,” to appear in The Journal
of Supercomputing, Kluwer Academic Publisher, 2004.
[23] Zarandi H., Miremadi S. G., Sarbazi-Azad H., “Fault
Detection Enhancement in Cache Memories Using a High
Performance Placement Algorithm,” IEEE International
On-Line Testing Symposium (IOLTS), 2004, pp. 101-106.
[24] Zarandi H., Miremadi S. G., “A Highly Fault Detectable
Cache Architecture for Dependable Computing,” to
appear in 23rd International Conference on Safety,
Reliability and Security (SAFECOMP), 2004, Germany.
[25] Zhang C., Vahid F., Najjar W., ”A Highly Configurable
Cache Architecture for Embedded Systems,” Int. Symp.
on Computer Architecture, 2003, pp. 136-146.

Proceedings of the 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems (ECBS’05)
0-7695-2308-0/05 $20.00 © 2005 IEEE

You might also like