You are on page 1of 8

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

Exploiting Early Tag Access for Reducing L1 Data Cache Energy in Embedded Processors
Jianwei Dai, Menglong Guan, and Lei Wang, Senior Member, IEEE M. Guan and L. Wang are with the Department of Electrical and Computer Engineering, University of Abstract— In this paper, we propose a new cache design technique,Connecticut, Storrs-Mansfield, CT USA (e-mail: referred to as early tag access (ETA) cache, to improve the energy 06269 efficiency of data caches in embedded processors. The proposed; technique performs ETAs to determine the destination ways of Digital Object Identifier memory instructions before the actual cache accesses. It, thus, enables 10.1109/TVLSI.2013.2241088

only the destination way to be accessed if a hit occurs during the ETA. The proposed ETA cache can be configured under two operation modes to exploit the tradeoffs between energy efficiency and performance. It is shown that our technique is very effective in reducing the number of ways accessed during cache accesses. This enables significant energy reduction with negligible performance overheads. Simulation results demonstrate that the proposed ETA cache achieves over 52.8% energy reduction on average in the L1 data cache and translation lookaside buffer. Compared with the existing cache design techniques, the ETA cache is more effective in energy reduction while maintaining better performance.

Index Terms— Cache, low power.



EDUCING power consumption in cache memory is a critical problem for embedded processors that target low-power applications. It was reported that on-chip caches could consume as much as 40% of the total chip power [1], [2]. Furthermore, large power dissipation could cause other issues, such as thermal effects and reliability degradation. This prob-lem is compounded by the fact that data caches are usually performance critical. Therefore, it is of great importance to reduce cache energy consumption while minimizing the impact

on processor performance. Many cache design techniques [3]–[12] have been pro-posed at different levels of the design abstract to exploit the tradeoffs between energy and performance. As caches are typically setassociative, most microarchitectural techniques aim at reducing the number of tag and data arrays activated during an access, so that cache power dissipation can be reduced. Phased caches [13] access tag arrays and data arrays in two different phases. Energy consumption can be reduced greatly because at most only one data array corresponding to the matched tag, if any, is accessed. Due to the increase in access cycles, phased caches are usually applied in the lower level memory, such as L2 caches, whose performance
Manuscript received June 4, 2012; revised October 5, 2012; accepted December 26, 2012. J. Dai is with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:

is relatively less critical. For L2 caches under the writethrough policy, a way-tagging technique [14] sends the L2 tag information to the L1 cache when the data is loaded from the L2 cache. During the subsequent accesses, the L2 cache can be operated in an equivalent direct-mapping manner, thereby improving energy efficiency without incurring performance degradation. To reduce the energy consumption of L1 caches, way-predicting techniques [15]–[17], [22] make a prediction on the tag and data arrays that the desired data might be located. If the prediction is correct, only one way is accessed to complete the operation; otherwise, all ways are accessed to search for the desired data. Mispredictions lead to cache re-accesses, which introduce a performance penalty. Another cache design technique, referred to as way-halting cache [9], stores some lower order bits of each tag in tag arrays into a fully associative cache and compares them against the corresponding bits in the coming memory address in parallel with set index decoding. If there is a match, only the corresponding data arrays need to be activated, thereby reducing the energy consumption of cache accesses. This

technique, however, requires architectural supports in some the overhead associated with cases. In addition, since the search is done in a fully associative the additional fully associative cache, the latency of the search might be longer than what the memory. In addition to static set index decoding takes if the number of sets is large. As a random-access memory result, this technique could potentially increase critical paths (SRAM)-based cache designs, and introduce performance penalty. For very high set- content addressable memory associative (e.g., 32-way) data caches in high-end (CAM) is also an option for microprocessors, a tech-nique proposed in [8] employs anlow-power embedded systems additional fully associative memory to record the way[10], [11]. However, the information of the recent cache accesses at the Load/Store access latency of CAM is in Queue (LSQ) stage. This memory will be searched prior to a general longer than SRAMcache access, and the potential destination way can be based caches, in particular determined if the address is found. This technique, however,when the cache associativity may not be effective for typical L1 data caches (e.g., four-way is low (e.g., four-way and set-associative) commonly used in embedded processors due to eight-way), which is common
1063–8210/$31.00 © 2013 IEEE

in high-performance embedded systems. In this paper, we propose a new cache technique, referred to as early tag access (ETA) cache, to improve the energy efficiency of L1 data caches. In a physical tag and virtual index cache, a part of the physical address is stored in the tag arrays while the conversion between the virtual address and

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

performed at the L1 cache.

proposed ETA cache. Section IV discusses the VLSI implementation of the ETA cache. Simulation results are provided in Section V. II. LSQ AND CACHE ARCHITECTURE

To reduce conflict misses, set-associative architectures are commonly employed in cache design. Fig. 1 shows a While many high-end simple processors utilize parallel LSQ and L1 data cache virtual address accesses for performance from LSQ improvement at the cost of index offset high cache traffic and TLB from other units energy consumption, the pro-posed ETA cache is ic more effective for embedded tag g processors, where the lo array C g A accesses to the LSQ and L1 d RAM data array data M e or data cache are typically de array array performed in series [24], ri n [25] to take advantage of tag load forwarding for array virtual reducing cache traffic and address to data to energy consumption. Note data cache and L1 data TLB cache that the proposed ETA sel cache may also be applied to some general purpose Load/Store L1 data Queue cache processors. For example, processors (e.g., Fig. 1. Conventional architecture of LSQ and L1 data cache. some Alpha 21264 [23]) have a dispatch stage during the the physical address is performed by the TLB. ByLoad/Store phase, at which accessing tag arrays and TLB during the LSQ the effective addresses of stage, the destination ways of most memory memory operations are Thus, upon instructions can be determined before accessingavailable. accessing the L1 data cache, the L1 data cache. As a result, only one way in the L1 data cache needs to be accessed for thesethe available destination instructions, thereby reducing the energyway can be utilized to consumption significantly. Note that the physicalreduce the cache energy Simulation addresses generated from the TLB at the LSQconsumption. stage can also be used for subsequent cache results show that the accesses. Therefore, for most memory instructions, proposed ETA cache is the energy overhead of way determination at the more effective in energy LSQ stage can be compensated for by skipping thereduction with better or performance as TLB accesses during the cache access stage. Forequal compared with the related memory instructions whose destination ways cannot be determined at the LSQ stage, an work. enhanced mode of the ETA cache is proposed to The rest of this paper is reduce the number of ways accessed at the cache organized as follows. In access stage. Note that in many high-end Section II, we review the processors, accessing L2 tags is done in parallel conventional LSQ and L1 with the accesses to the L1 cache [27]. Our data cache archi-tecture. In technique is fundamentally different as ETAs areSection III, we present the


components. In the conventional L1 data cache, all tag arrays and Address data arrays are activated cycle n Generation simultaneously for every read/write access to address reduce the access latency. and data (if Typically, the latency of ready) the L1 data cache is one clock cycle, though this cycle latency could be higher in (n+1) LSQ stage Load/Store Queue deeply pipelined processors. On the other hand, accesses to the tag ready and has arrays can always be the finished in one cycle [27], highest [28]. Due to the priority temporal/spatial locality t r inherent in various u programs [31], data will e stay for a while once they have been brought into the Cache access data cache. This implies e e e e that the tag arrays will v v keep their contents cycle v v a stage m unchanged except when a m> a a ct act cache miss occurs. c c n+1 i i Fig. 2 shows a portion t t of the pipeline between i i the address generation way 1 way 2 way way stage and the memory 1 2 stage in a typical tag embedded processor, array data array where the LSQ and L1 L1 data cache data cache are accessed in Fig. 2. Pipeline of a load/store instruction between LSQ and series [24], [25] to take L1 data cache. advantage of load forwarding for reducing cache traffic and energy consumption. A load/store two-way set-associative cache and TLB, where instruction will be sent to the tag and data arrays are the two major the LSQ before being

issued to the data cache. Meanwhile, the instruction is compared with the existing ones in the LSQ to determine whether the instruction can be issued to the data cache at the next clock cycle. If not, the instruction will stay in the LSQ stage for more than one clock cycle. From this architecture, we observe that: 1) the memory address of a load/store will be available in the LSQ stage for at least one clock cycle before being issued to the data cache, while the access to the tag arrays can be finished in one cycle and 2) due to the temporal/spatial locality, the tag arrays will not be updated until a cache miss occurs. Since in the conventional architecture, the destination way of a memory instruction cannot be determined before the cache access stage, all the ways in the L1 data cache need to be activated during a cache access for performance consideration at the cost of energy consumption. Based on these observations, we propose a new cache technique in the next section to improve the energy efficiency of L1 data caches.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.


Address generation
add ress address and data (if ready)


Load/Store Queue LSQ stage other wise


ready and has the highest priority

t r u e

Cache access stage e e v v e



a c

i n

a c t i

ina cti way 2

way 1 way 2 tag array

way 1

data array L1 data cache

LSQ stage Tag array access ETA

Cache Access Stage actual tag access actual TLB access cache hit/miss

TLB access Tag hit/miss

early TLB access early tag hit/miss early TLB hit/miss early destination way

TLB hit/miss Destination way

TLB hit/miss actual destination way

000 0

1111111 1

tag dat a1

0011001 1


LSQ stage

performed, as Fig. 3. Operation of a load/storeshown in the instruction between LSQ and L1 data dotted lines in cache under the proposed ETA cache. Fig. 3. This new set of LSQ tag III. PROPOSED arrays and LSQ ETA CACHE TLB are implemented as a In a conventional setcopy of the tag associative cache, all ways in the arrays and TLB tag and data arrays are accessed of the L1 data simultaneously. The requested cache, data, however, only resides in respectively, to one way under a cache hit. The avoid the data extra way accesses incur contention with unnecessary energy consumpthe L1 data tion. In this section, a new cache cache. If there is architecture referred to as ETA a hit during the cache will be developed. The LSQ lookup ETA cache reduces the number operation, the of unnecessary way accesses, matched way in thereby reducing cache energy the LSQ tag consumption. To accommodate arrays will be different energy and used as the performance requirements in destination way embedded processors, the ETA of this cache can be operated under two instruction when different modes: the basic mode it accesses the and the advanced mode. L1 data cache subsequently. If this destination A. Basic Mode way is correct, From Section II, it is possible only one way in to perform an access to the tag the L1 data arrays at the LSQ stage due to cache needs to the availability of memory be activated and addresses. In the basic mode of thus enables the ETA cache, each time a energy savings. memory instruction is sent into On the other the LSQ, an access to a new set hand, if a miss of tag arrays and TLB (referred occurs during the to as LSQ tag arrays and LSQ lookup operation TLB, see Section IV-A) is

in the LSQ tag way 0 illustrate this, replace arrays or in the consider a simple LSQ TLB, the tag data example shown L1 data cache in Fig. 4. Assume 001 1110010 will be accessed that an 1 in a conventional 0 instruction Inst1 manner, i.e., all loads data1 Cache Access stage into ways in the tag register R1. At arrays and data the LSQ stage, arrays of the L1 data1 is stored in way 0 data cache will the way0 of the way 1 be activated. L1 data cache. From Fig. 3, weFig. 4. Illustration of Thus, Inst1 has coherence an early hit at the can see that thecache problem related to two sets of tagearly destination way LSQ stage and its arrays and TLBinformation. early destination are accessed at way is two different determined as stages: LSQ It is possible way0. When stage and cachethat the early Inst1 accesses the access stage. Todestination way L1 data cache at of a memory the cache access differentiate these accesses,instruction stage (which may we use thedetermined at the happen several LSQ stage is not clock cycles later terminology defined in Tablethe same as the if some actual one instructions I. in determined at the the LSQ have a cache access higher priority stage due to the than Inst1), data1 cache misses that actually resides happen in in the way1 of between. This the L1 data cache causes a cache due to a cache coherence miss that happens problem if the between the LSQ memory stage and cache instruction access stage of simply follows its the instruction early destination Inst1. In this way, if any, to case, a cache access the L1 coherence data cache. To

problem will occur if Inst1arrays during the simply uses its early destinationcache access way to access the L1 data cache. stage. During the To avoid this problem, the ETAsame time, the cache in the basic mode requiresdata arrays are the memory instruction to access also accessed in all the ways in the actual tagparallel using the

early destinationwhile way only (as themaintaining actual destinationperformance. The way is notdestination way available yet) toobtained from reduce energythe actual tag consumption arrays is then

compared with the early destination way to detect any cache

Thank you for evaluating

BCL easyConverter Desktop
This Word document was converted from PDF with an evaluation version of BCL easyConverter Desktop software that only converts the first 3 pages of your PDF. Activate your software for less than $20

CTRL+ Click on the link below to purchase