You are on page 1of 10

IET Computers & Digital Techniques

Research Article

Impact of spintronic memory on multicore ISSN 1751-8601


Received on 4th November 2015
Revised 19th February 2016
cache hierarchy design Accepted on 4th May 2016
E-First on 25th January 2017
doi: 10.1049/iet-cdt.2015.0190
www.ietdl.org

Cong Ma1 , William Tuohy2, David J. Lilja1


1Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis 55455, USA
2Department of Computer Science, University of Minnesota, Minneapolis 55455, USA
E-mail: maxxx376@umn.com

Abstract: Spintronic memory [spin-transfer torque-magnetic random access memory (STT-MRAM)] is an attractive alternative
technology to CMOS since it offers higher density and virtually no leakage current. Spintronic memory continues to require
higher write energy, however, presenting a challenge to memory hierarchy design when energy consumption is a concern. This
study motivates the use of STT-MRAM for the first-level caches of a multicore processor to reduce energy consumption without
significantly degrading the performance. The large STT-MRAM first-level cache implementation saves leakage power. Moreover,
the use of small level-0 cache regains the performance drop due to STT-MRAM long write latencies. The combination of both
reduces the energy-delay product by 65% on average compared with CMOS baseline. The proposed STT hierarchy also shows
good scalability over the CMOS with a few benchmarks which scale significantly better. The PARSEC and Splash2 benchmark
suites are analysed running on a modern multicore platform, comparing performance, energy consumption and scalability of the
spintronic cache system to a CMOS design.

1Introduction implementing the first-level STT-MRAM cache. Indeed, the


majority of research in the area of on-chip STT-MRAM has been
As CMOS technology starts to face serious scaling and power focused on last-level caches in [2, 3, 7].
consumption issues, the current static random access memory To be a true replacement for CMOS, however, it would be
(SRAM) designs become unable to meet the demand of big, fast desirable to use STT-MRAM at all levels of on-chip cache. CMOS
and low-power on-chip cache for multicore implementations. A caches have evolved toward deep hierarchies with multiple levels
new technology, spin-transfer torque-magnetic RAM (STT- of private caches in multicore designs, since read and write
MRAM), one of the novel non-volatile memory family, has drawn latencies and power are similar. In a modern chip-multiprocessor
substantial attention in recent years. STT-MRAM offers higher (CMP), multiple copies of data exist in different caches, and more
density than traditional SRAM cache, and its non-volatility data movement occurs between caches for sharing. These extra
facilitates low leakage power [1]. Also, STT-MRAM is one of the cache updates beyond those seen in a single-core processor
few candidates with similar read latency to current SRAM increase the energy consumption. Fig. 1 shows a typical multicore
technology. With this higher cell density and low leakage power, hierarchy, highlighting the fact that multiple copies of a cache line
STT-MRAM is generally considered as a viable potential typically exist across the hierarchy. Data sharing requires extra data
alternative to SRAM in future on-chip caches. movement across the hierarchy.
However, due to its non-volatile nature, this technology suffers The significant leakage reduction potential and extremely long
from high-dynamic energy consumption, primarily due to high write latency of the STT-MRAM motivate this paper to find
write power and longer write latency [2]. The write latency of the optimal cache hierarchy design to reduce cache energy
STT-MRAM is commonly approximated as 34 times that of consumption without significantly degrading performance. To best
SRAM [3], but some consider it to be larger [4], so we perform our exploit the increased density and reduced leakage power of STT-
analysis over a range of latencies. These characteristics seem to fit MRAM, it is necessary to overcome the high-dynamic write energy
well at the larger last-level caches of a processor, where high and latency of STT-MRAM at the lower-level caches of a large
capacity is desirable and longer latency is tolerated. Previous CMP. We utilise a novel physics-based model of magnetic tunnel
research [5, 6] also showed a performance drop after directly junction (magnetic tunnel junction) switching to develop size and
energy models of STT-MRAM cells [8]. Since the usage of first-
level and last-level caches is quite different, we have evaluated
different circuit-level tradeoffs between magnetic tunnel junction
(MTJ) read and write energies to find optimal design points for
energy and performance. A drop-in replacement of STT-MRAM to
CMOS shows a fundamental problem of an exposed mismatch in
the bandwidth of data being written by the processor and the ability
of the STT-MRAM cache to absorb it. By introducing a small, fully
associative level-0 (L0) cache, this bandwidth mismatch can be
accommodated. This structure also benefits cache dynamic energy
consumption, since it is so small that both its static and dynamic
energies use are quite low. This is an extension of the analysis in
[9], which analysed the effectiveness of a small L0 of various sizes
compared with a simpler two-level CMOS hierarchy. The
contributions of this paper include:
Fig. 1 Typical multicore cache hierarchy, with multiple copies of data

IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59 51


The Institution of Engineering and Technology 2016
Fig. 2 An STT-MRAM 1T1MTJ bit-cell
(a) Showing the access transistor and MTJ storage element, In the P state (b) resistance through the device is lower than the (c) AP state

A detailed analysis of the impact of high write latency at the L1 simulate the STT-MRAM MTJ cell by changing the cell width and
cache level including the tradeoff between read and write aspect ratio to fit one-transistor and one-magnetic tunnel junction
energies and latencies of STT-MRAM caches. (1T1MTJ) and setting all cell leakage power to 0. This model
Demonstration of the benefit of the write-merging L0 cache to treated the STT-MRAM like an SRAM cache with smaller cell size
performance of a fixed-core count system and energy and different cell dimensions. This approach created a conservative
consumption. estimate of the STT-MRAM cache array since there is the potential
An analysis of scalability with increasing numbers of cores, to further optimise the STT-MRAM circuitry. We also evaluated
comparing CMOS caches to STT-MRAM caches. the energy and timing values produced by the NVSim [12] non-
volatile memory simulator. We found that our approach produced
slightly more conservative parameter values than NVSim, though
2Simulation methodology the results were comparable. The bit-cell write energy from the
A SRAM storage cell using STT-MRAM is depicted in Fig. 2a. SPICE simulation multiplied by the cache line size was then added
The bit-cell consists of an access transistor and a storage element to the dynamic write energy value from Cacti to estimate STT-
that uses a MTJ, known as a one-transistor and one-magnetic MRAM cache write energy. Since not every bit in a cache line
tunnel junction cell. The MTJ consists of a fixed layer and a free would switch on a write operation, this gives a large, and thus
layer, separated by a thin insulator that allows a tunnelling current conservative, estimate of write energy. Table 1 lists the values
to flow when biased. The material used for the fixed and free layers gathered from the SPICE simulations. We observed that as the
has two stable spin directions, with the spin direction of the fixed transistor size is made smaller, the bit-cell write energy increases,
layer locked. The free layer can have either the same spin but the array-level dynamic read and write energy decreases. The
orientation as the fixed layer, known as parallel (P) (Fig. 2b) or total write energy trends are in different directions for L1 and L2
anti-parallel (AP) (Fig. 2c). Resistance through the device is higher cache arrays, as shown in Fig. 3. L1 and L2 caches have different
in the AP state, so a read operation consists of sensing the high or read and write access patterns as well, with an L1 cache typically
low resistance value. For a write operation, the current passing seeing a higher percentage of read operations, so the optimal
through the device in one direction gives the free layer the P design point could be different for both caches. For L1 cache, the
orientation, while passing the other direction creates the AP optimal point is somewhere in the middle, around the 5ns write
orientation. There is a critical minimum write current which must region, since write energy grows quickly to the left and read energy
be maintained for an adequate period of time to allow complete grows to the right. However, since the system performance is very
switching, leading to longer latency of write operations. The access sensitive to L1 cache write pulse latency, a short write is still
transistor must be sized to provide a sufficient switching current, preferred. For L2 cache, the trends are similar, so at the optimal
and this higher current (above the critical value) facilitates a point there is the long write region. Previous work [1, 7] has shown
shorter switching delay. The access transistor is usually larger than that L2 write latency has little effect, so we pick 7ns in this paper.
the MTJ, so there is a tradeoff between bit-cell area and switching The CMOS cache energy parameters were directly modelled from
time. A larger bit-cell can have a lower switching time and lower Cacti as listed in Table 2.
energy, but creates a larger array for the same storage capacity,
leading to longer wires and higher energy requirements at the array 2.2 Architectural simulation
level.
Complete architectural simulations were used for both design space
2.1 Technology modelling comparisons and scalability analysis. The simulations were
performed with the Sniper simulator [13] running the PARSEC
A combination of Simulation Program with Integrated Circuit [14] and Splash2 [15] benchmark suites. The two benchmark suites
Emphasis and Cacti [10] simulations were used to develop the have fundamentally different properties, where Splash2 focuses
technology models in this analysis. Simulation program with more on high-performance computing and PARSEC includes a
integrated circuit emphasis (SPICE) was used for bit-cell wide range of applications [16]. A combination of the two can
simulations, and these results were entered into Cacti for array improve the benchmark programme diversity. The sniper simulator
modelling. The SPICE models developed for [8] were used to is based on an analytical model that estimates the performance by
simulate MTJ switching energy and transistor sizing for the write analysing intervals. This model achieved ten times simulation
pulse widths of interest. Retention times of 10 years as well as 1 speedup with relatively high accuracy [13]. The speedup gave us
year were simulated for a 20nm predictive technology model [11]. the ability to directly measure the complete benchmark execution
The 6 methodology is used in the SPICE models to eliminate time to avoid using any sampling scheme, which is not ideal for
defects from process variations, especially for STT sensing and multi-thread benchmarks [17]. The simulator was modified to
write delay. properly model cache write latency in all relevant operations. A
The bit-cell and transistor sizes from these simulations were range of cache hierarchies with various capacity, access and write
then used in a modified Cacti to generate array energy and timing latencies were simulated, using the MESI cache coherence protocol
values for various operations. Since Cacti does not support STT- with strict inclusive policy. Table 3 lists the processor
MRAM modelling, we modified the original Cacti SRAM model to configurations used in the simulations. All based on a four-wide

52 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59


The Institution of Engineering and Technology 2016
Fig. 3 Different trends in read and write energies for MTJ cells used in L1 (left) and L2 (right) caches. The Write_total is a combination of Write_cell and
Write_access; Write_cell is the per-bit switching energy from SPICE and Write_access is the array access energy reported by Cacti

out-of-order execution model running at 2GHz. Structures such as the simulations. The instruction cache was implemented using 64
reorder buffer (ROB) were deliberately set to be on the large side kB CMOS cache in all configurations to minimise its impact.
to try to remove these structures as possible hidden bottlenecks in All data is reported just for the P region-of-interest. The number

Table 1Energy consumption parameters for the STT cache structures


Size 64kB 128kB 256kB 4MB 16MB 32MB 64MB
MTJ transistor size (F2) 24 48 144 24 48 144 24 48 144 24 48 84 24 24 24
MTJ latency, ns 7 5 3 7 5 3 7 5 3 7 5 3.6 7 7 7
MTJ flip energy, fJ 508 418 378 508 418 378 508 418 378 531 432 397 531 531 531
read, pJ 15.6 17.8 30.1 18.6 21.4 31.0 24.3 28.4 48.4 152 190 223 255 406 766
WrtAccess, pJ 19.1 22.7 47.6 25.2 30.2 44.9 37.4 45.6 62.7 173 269 315 286 448 949
WrtCell, pJ 260 214 194 260 214 194 260 214 194 272 221 203 272 272 272
Wrt, pJ 279 237 241 285 244 238 298 260 256 445 490 518 558 720 1221
leakage, mW 1.7 1.9 2.6 2.6 2.8 3.3 4.4 4.6 6.7 64.2 98.2 126 232 560 1048

Table 2Energy consumption parameters for the CMOS cache structures. Leakage is megawatt
Size 1kB 4kB 32kB 64kB 4MB 8MB 16MB
rd, pJ 6.9 7.2 28.3 35.0 293 520 1003
Wrt, pJ 7.6 10.7 30.6 40.2 344 512 1114
leakage 0.76 1.33 13.45 28.2 736 1440 2984

Table 3Simulated processor configurations of processors was varied from 4 to 16. Typical two-level and three-
Parameter Values level cache hierarchies were analysed, with a shared last-level
pipeline four-wide, out-of-order cache and private cache(s) per core. A crossbar interconnect joined
all cores. Table 4 lists the values of system parameters simulated
L1 ICache 64kB, 2-way
for each benchmark. The access latency in this table refers to the
ROB entries 128 time the cache access is being completed or the requested data
memory 45ns latency, 7.6GB/s bandwidth returned, which often refers to the read latency. The write latency
refers to the STT-MRAM write pulse delay. We assumed the access
latency of STT L1 cache is 2550% longer and STT L2 cache is
Table 4Simulated STT-MRAM cache parameters. For 20% faster than the CMOS cache according to [18]. The longer
writes, the access latency is added to the write latency sensing time of the STT-MRAM due to read disturbance
Parameter Values significantly increases the access latency of a smaller cache, but the
access latency (CMOS) L0 1 or 3, L1 4, L2 30 cycles shorter interconnect delay makes it faster for larger caches. The L2
access latency (STT) L1 5 or 6; L2 24 cycles cache associativity of each configuration was larger than the sum
of all L1 cache sets to avoid cache misses due to inclusive policy.
write latency (STT) L1 5ns, optional 3, 7ns; L2 7ns
The large datasets of both suites were used for all simulations. The
L0 DCache 1K, 4K fully associative, private native datasets were used on a few benchmarks to show the
L1 DCache 64K, 128K, 256K 4-way associative, scalability observations are valid on real workloads.
private Table 5 shows the main configurations in the following graphs.
L2 cache four cores 4MB, 16MB 16-way associative, eight The names in the left-hand column are how these configurations
banks will be referred to and labelled in the graphs. We picked 7ns write
L2 cache eight cores 8MB, 32MB 32-way associative, eight latency for STT L2 and 5ns for STT L1 cache due to performance
banks and energy concerns. The MTJ transistor sizes for L2 and L1 are
L2 cache 16 cores 16MB, 64MB 64-way associative, eight 24F2 and 48F2 as in Fig. 3, while a common 6T SRAM cell can be
banks 135F2 [18]. All stt configurations used an STT L2 cache. In a
conservative manner, we estimated the STT L2 to have four times
IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59 53
The Institution of Engineering and Technology 2016
Fig. 4 (a), (b) Compares the CMOS-base (64K, 4M) to STT hierarchies (128K, 16M). Write latency of the L1 cache results in significant performance drop.
(c), (d) Compares the CMOS-base to STT hierarchies that use the write-merging L0. The hierarchy uses a 4K fully associative L0 cache and an STT L1 cache
with various write latencies
(a) PARSEC benchmark suite, (b) Splash2 benchmark suite, (c) PARSEC benchmark suite (L0 implemented), (d) Splash2 benchmark suite (L0 implemented)

the size of the complementary metaloxidesemiconductor absorb a certain amount of write data before the processor must
(CMOS)-base and the STT L1 cache to have two to four times the stall, if the cache system cannot keep up with the offered load.
size of the CMOS L1 in the same footage. Specifically, l1d2 means Figs. 4a and b show the impact of write latency on performance
the STT L1 has twice the density (same footage) of the CMOS L1 when the CMOS L1 cache is replaced by STT-MRAM. Extra
and l0z4 means the CMOS L0 size is 4kB. capacity in the L1 made possible by the higher density of STT-
MRAM cannot compensate for the reduction in write bandwidth
3Performance, energy and scalability seen by the processor.
A method to match the bandwidth between the CPU and the L1
Performance, energy consumption, energy-delay product (EDP) cache is needed that does not also increase cache energy
and scalability of the cache hierarchy are the primary metrics of consumption significantly. Augmenting the STT-MRAM L1 with a
interest. The performance here uses the total execution time. The small fully associative CMOS L0 cache was investigated in [9] and
energy refers to the overall central processing unit (CPU) cache found to be an effective method to restore the performance lost to
energy consumption, but not the entire CPU. The scalability refers higher write latency. The L0 cache acts as a write-merging buffer,
to the speedup from different number of cores. In this section, we translating single-word writes from the CPU into cache line writes
will examine the experimental results of our proposed STT-MRAM to the L1. If enough of the processor traffic is handled by this L0,
cache hierarchy compared with a CMOS baseline. The comparison performance can be restored. We have implemented this structure
of performance and energy results are based on a fixed number of as a standard write-back cache, so it uses a standard cache
cores among different hierarchies. We explore the scalability of controller with no extra functionality such as a hybrid cache or low
proposed hierarchies from 4 to 16 cores to investigate the STT- retention-time cache would require. By keeping the L0 as small as
MRAM impact. At last, we will show the performance and energy possible, access time and leakage power are kept low. Figs. 4c and
impact when implemented a larger L0 cache. d demonstrate the improvement from using the L0. When the L0 is
4kB, the average miss rate can be as low as 2% in Fig. 5. The
3.1 Performance performance of the STT-MRAM two-level hierarchy differ
significantly among 3, 5 and 7ns write latencies by almost 50%,
The challenge of using STT-MRAM for lower-level caches (closer while the performance difference of the three-level hierarchy with
to the CPU) is to overcome the added write latency and dynamic the L0 implementations shrinks to 15%. The three-level hierarchy
write energy. The reward is increased density and significantly performs 40% better on average than the two level with 5ns write
reduced leakage power, which comprises the majority of the power latency.
consumed in a CMOS cache hierarchy. Long write latency at a
first-level cache creates a bandwidth mismatch between the
processor pipeline and the cache. Queueing and buffering can only 3.2 Energy comparison
By reconfiguring the cache hierarchy to contain as little CMOS
circuitry as possible, leakage power is reduced significantly. With
Table 5Simulated cache hierarchies
the L0 write-merging cache as the only all-CMOS structure, the L1
Configuration L0 Sz, K L1 Sz L2 Sz and Lower Level Cache can be configured for larger capacity in a
name given area. We have simulated several different combinations of
CMOS-base CMOS 64 CMOS 4MB/8MB/16MB cache size at all three levels. Figs. 6a and b show the potential
stt-l2 CMOS 64 STT 16MB/32MB/64MB energy savings with this three-level configuration. All graphs in
stt-l1d2 STT 128 STT 16MB/32MB/64MB this section use the simlarge dataset to stress cache capacity as
stt-l1d4 STT 256 STT 16MB/32MB/64MB much as possible in simulation. SRAM L2 cache leakage and
stt-l0z1 CMOS 1 STT 128 STT 16MB/32MB/64MB SRAM L1 cache leakage energies take 80 and 10% on average of
the total cache energy consumptions, respectively. The total energy
stt-l0z4 CMOS 4 STT 256 STT 16MB/32MB/64MB
use drops significantly after adopting the STT L2 cache (stt-l2) by

54 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59


The Institution of Engineering and Technology 2016
Fig. 5 Near core cache miss rate. Though the 1kB fully associative L0 cache does have a large cache miss rate, the 4kB cache on average is <5%
(a) PARSEC benchmark suite, (b) Splash2 benchmark suite

Fig. 6 (a), (b) Shows the total energy consumption normalised to CMOS-base (64K,4M) with stt-l2 (64K, 16M), stt-l1d2 (128K, 16M) and stt-l0 with
varying L0 sizes (1/4K, 128K, 16M). The CMOS3 L2 leakage is computed for a 4MB STT-MRAM cache to create a fair baseline. (c), (d) Shows the dynamic
energy consumption with the same configurations as in (a), (b)
(a) PARSEC benchmark suite (total energy), (b) Splash2 benchmark suite (total energy), (c) PARSEC benchmark suite (dynamic energy), (d) Splash2 benchmark suite (dynamic
energy)

almost 60%. This further drops another 10% after using the 4kB baseline and a 25% EDP reduction over the configuration only
L0 STT configuration (stt-l0z4). Total energy savings with the with L2 STT-MRAM. The first 65% EDP reduction comes mainly
small L0 implementation on average are 70% for both PARSEC from the energy reduction since the implemented STT L2 has little
and Splash2 benchmark suites. Figs. 6c and d show the dynamic impact on the benchmark performance due to less frequent L2
energy consumption of various STT hierarchies, omitting L0, L1 write accesses. Moreover, since the L1 SRAM cache leakage
and L2 leakages. Different L0 and L1 sizes are shown. The large energy becomes larger at 25% once having the STT L2 cache
cross-hatched segment in the middle of the 1kB L0 (stt-l0z1) bars implemented, we can potentially achieve a significant energy
represents cache line writes to the L1. With the smaller L0, there reduction if switched SRAM L1 to STT L1 as well. However, due
are a large number of L0 write-backs of modified data to the L1. to the long write latency and much higher write access frequency in
This segment decreases rapidly as the L0 size increases. L1, such direct replacement results in much slower execution time
and larger-dynamic energy use as in Figs. 6c and d. By
3.3 Energy-delay product implementing the small L0, a significant number of CPU writes are
absorbed by this small L0, so the write access frequency to the STT
Though the tradeoff between energy use and performance is L1 becomes small, further reducing dynamic energy use and
common in the system design, improving both at the same time can improving the performance. Besides, the small L0 can potentially
rarely happen. To show the overall merit of the hierarchy design, provide a faster CPU-side cache access than the STT L1 due to its
we have used the EDP as a better metric to highlight the most longer sensing time. The 25% EDP reduction over the STT L2
balanced architecture between energy efficiency and performance implementation is thus achieved by this small L0 scheme.
[19]. Figs. 7a and b show the EDP of a system with four cores. On
average, both benchmark suites of 4kB L0 (stt-l0z4) show a 3.4 Scalability
significant 65% EDP reduction over the baseline (CMOS-base).
The 4kB L0 (stt-l0z4) shows a 25% reduction over the To analyse the scalability of the workloads, we have used full
configuration that only replaced the CMOS L2 with STT (stt-l2). simulation runs with the large and native simulation datasets. Fig. 8
This is approximately an additional 10% reduction over the CMOS shows the scalability from 4 to 16 cores of both PARSEC and
baseline. Figs. 7c and d show the EDP of a system with 16 cores. Splash2 benchmark suites. Only the slopes of the lines are being
The observations of the four cores still hold for the 16 cores compared here among different configurations. With the simlarge
system. In summary, the 4kB L0 hierarchy of both benchmark dataset, both suites show that the STT-MRAM does not
suites achieves an average 65% EDP reduction over the CMOS significantly impact the scalability over the CMOS-base. Canneal

IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59 55


The Institution of Engineering and Technology 2016
Fig. 7 (a), (b) Shows the EDP of various STT hierarchies with four cores. (c), (d) Shows the EDP with 16 cores. All normalised to CMOS-base
(a) PARSEC benchmark suite (four cores), (b) Splash2 benchmark suite (four cores), (c) PARSEC benchmark suite (16 cores), (d) Splash2 benchmark suite (16 cores)

and facesim from PARSEC show good scalability over other Researchers have proposed implementing the STT-MRAM L1
benchmarks. According to Bhadauria et al. [20], canneal was cache to take advantage of its larger capacity and significant
limited primarily by memory latency rather than bandwidth due to smaller leakage power in [5, 6, 21]. In [6], a detailed evaluation
lower data reuse. However, this observation was made under a flow of the emerging non-volatile memory technologies including
relatively small lower level cache, while Tables 6 and 7 show that a STT-MRAM was described to explore the next generation memory
larger STT cache have better cache reuse for both simlarge and hierarchy. The feasibility of STT L1 data and instruction cache
simnative input datasets. The simlarge and simnative input datasets implementation was evaluated and a performance drop due to the
of canneal are 256MB and 2GB, respectively. Since it is possible STT L1 data cache was observed. Guo et al. [5] investigated the
that the working set of simlarge can fit in a 64MB STT L2 but STT-MRAM L1 implementation in a single-issue in-order eight
simnative cannot, we further investigate the simnative input dataset core system. A larger STT L1 under the same area provided better
of canneal to see whether the larger L2 impact persists. The performance, but larger total power due to the CMOS peripheral
scalability of simlarge input dataset reaches a maximum at 32MB, circuitry overhead. Li et al. [21] implemented a one-level STT-
while scalability of the simnative input dataset increases with the MRAM cache in a simple embedded system with in-order single-
L2 cache capacity. This means with a real workload, a larger L2 core configuration. They proposed a compiler-assisted refresh
cache (up to 128MB) could improve the scalability of canneal. We scheme for the implemented volatile STT-MRAM, which
observed bad scalability from swaptions of PARSEC and raytrace significantly reduced the refresh frequency, minimising the
of Splash2. Freqmine only runs single-threaded due to lack of dynamic refresh energy. These studies mainly focused on
Open Multi-Processing support in the sniper simulator. In general, evaluating the implementation of STT caches in simple in-order
the STT-MRAM with 4K L0 (stt-l0z4) hierarchy has good CPUs rather than high-performance computation platforms.
scalability over the CMOS-base, with a few cases including To address the write power and latency problems, researchers
facesim, canneal and cholesky showing significant scalability have proposed several techniques: decreasing the retention time [2,
improvements. 4, 7], modifying cache hierarchy to use a mix of structures with
different properties [1, 4, 22, 23], implementing policies to limit
3.5 Larger L0 impact write operations to high-power structures [3, 2428], and using
hybrid cache architectures [24, 29, 30]. Decreasing the retention-
It is clear that the implementation of this small fully associative L0 time trades reliability for device area and energy on a device level,
achieved a good tradeoff between performance and energy use. We while the cache policy optimises energy consumption on a system
further investigated a larger L0 to see whether better EDP can be level. Both of them achieved significant energy reduction and
achieved. A 32kB 4-way associative and 64kB 8-way associative relatively less attractive performance improvement, but required
L0 was implemented into the same hierarchy as in Figs. 9 and 10. either additional logic or changes to the cache control scheme.
With such big caches, fully associativity becomes too costly for Reduced retention time: Retention time can potentially be
dynamic energy assumption and access delay. Fig. 9 shows an reduced in caches by reducing the MTJ volume since the lifetime
average 10% performance improvement, but 2040% more energy of a cache line can be much shorter than the typical 10 years. This
use after adopting these larger L0 caches. The overall EDP would allow a reduction of MTJ write current. Reduced retention
increases by more than 20% in both benchmark suites in Fig. 10. time was proposed and analysed in [2] for on-chip caches on a
Since the 4kB L0 already has a low-average miss rate at 2% in single-core chip. They proposed an SRAM L1 cache, and reduced
Fig. 5, performance improvement space is little by simply retention-time STT-MRAM L2 and L3 cache hierarchy designs,
increasing L0 size. However, the static leakage and dynamic which showed an improvement of energy reduction of 70%, but at
energy will increase significantly. The 4K L0 remains a better a small performance loss. To ensure the reduced retention-time
design point to tradeoff performance and energy use. STT-MRAM is reliable, they further proposed a refresh scheme
similar to dynamic RAM refresh technology. Optimal retention
4Related work times were studied in [7] for the last-level cache, settling on a
retention time of about 10ms after a detailed application profiling
for CMPs. They proposed a victim-cache structure to handle those

56 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59


The Institution of Engineering and Technology 2016
Fig. 8 Scalability of various architecture hierarchies using from 4 to 16 cores including two-level CMOS-base (64K,4/8/16M), stt-l2 (64K,16/32/64M) and
stt-l1d2 (128K,16/32/64M), three-level stt-l0z1 (1K,128K,16/32/64M) and stt-l0z4 (4K,128K,16/32/64M). The 3-level hierarchy has a similar scalability
as the two-level CMOS-base, but canneal and facesim particularly show better result than others
(a) PARSEC benchmark suite, (b) Splash2 benchmark suite

cache lines that exceeded their corresponding retention time and single-level relaxed retention scheme. A readwrite aware hybrid
achieved 18 and 60% improvement in performance and energy cache hierarchy was presented in [29], where the cache is divided
separately. However, because a bit-flip could happen at any time into a write cache section based on SRAM and a read section based
during the cache line life, an error correction coding scheme should on non-volatile memories including STT-MRAM. They suggest an
be introduced to the data checking procedure before refreshing intra-cache data movement policy which produces an overall
[31]. power reduction up to 55% in addition to a 5% performance
Though reducing retention time can potentially decrease cache improvement over the baseline SRAM L2 and L3 caches. A novel
line write energy and latency, extra error handling units must be management policy using hybrid cache design was shown in [30]
added to maintain cache reliability. This scheme uses an orthogonal that aims at improving cache lifetime by reducing write pressure on
approach to reduce STT-MRAM dynamic energy compared with STT-MRAM caches. They show a 50% reduction in power for a L2
our hierarchy scheme, leaving an opportunity to combine the two shared cache along with a substantial improvement in a cache
in the future. lifetime.
Hybrid schemes: Implementation of STT-MRAM across the Hybrid schemes require complex control units to dispatch
entire cache hierarchy including the L1 cache was considered in requests to different cache devices. Our scheme that only
[4]. The work implemented low-retention devices in the L1 cache implements a standard cache level can avoid these extra control
with a dynamic refresh scheme and further proposed a mixture of units. This keeps our design simple and straight forward, thus
retention times in the last-level cache. By using a data migration leveraging existing schemes.
scheme, read-intensive data and write-intensive data can be The idea of a small, fully associative cache was first proposed
allocated to different retention-time region, which gives a 6.2% in [32] to remove mapping conflict misses in a direct-mapped
performance improvement and 40% energy improvement over the cache by putting it in the refill path. To reduce the microprocessor

Table 6Cache capacity impact on performance and cache reuse of canneal with simlarge dataset. The execution time of each
configuration is normalised to four cores with 4MB L2 cache. The percentage is the L2 cache miss rate
Cores 4MB 8MB 16MB 32MB 64MB 128MB
4 1 (43%) 0.84 (38%) 0.68 (31%) 0.61 (25%) 0.477 (20%) 0.476 (18%)
8 0.477 (38%) 0.41 (31%) 0.32 (22%) 0.30 (16%) 0.29 (12%)
16 0.36 (31%) 0.19 (22%) 0.16 (12%) 0.15 (8%)

Table 7Cache capacity impact on performance and cache reuse of canneal with simnative dataset. The execution time of
each configuration is normalised to four cores with 4MB L2 cache. The percentage is the L2 cache miss rate
Cores 4MB 8MB 16MB 32MB 64MB 128MB
4 1 (94%) 0.94 (90%) 0.84 (83%) 0.73 (70%) 0.58 (53%) 0.49 (35%)
8 0.67 (90%) 0.61 (82%) 0.51 (70%) 0.38 (53%) 0.27 (35%)
16 0.55 (83%) 0.47 (70%) 0.35 (53%) 0.22 (34%)

IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59 57


The Institution of Engineering and Technology 2016
Fig. 9 Figure shows the performance and energy use of several configurations including CMOS baseline, stt-l1d2, stt-l0z1, to stt-l0z64, where stt-l0z32 and
stt-l0z64 have 32kB 4-way assoc and 64kB 8-way assoc L0 cache. The average performance improves <10% and the energy use increases more than 25%
from stt-l0z4 to stt-l0z64
(a) PARSEC performance, (b) Splash2 performance, (c) PARSEC energy, (d) Splash2 energy

Fig. 10 Figure shows the EDP of several configurations including the larger L0 implementations. The overall EDP increases 20% on average for both
benchmark suites
(a) PARSEC EDP, (b) Splash2 EDP

energy use, Kin et al. [33] proposed a small direct-mapped cache A fully associative L0 cache as small as 4kB can effectively
as a filter cache on the core side that achieved almost 60% energy restore performance lost to the higher write latency. This structure
reduction with 20% performance drop. In [24], a read-preemptive hides the extra write latency of around 5ns when running at 2
write buffer of 20 entries on the memory side was proposed to GHz, giving a total cache energy savings of 4070% and an
reduce the read stalls to the STT-MRAM L2 cache during the long average EDP reduction of 60% compared with the CMOS baseline.
write operations by implementing rules that favour read operations. The L0 cache is implemented as a standard cache level, requiring
The write requests to the STT-MRAM L1 cache, however, mainly no additional control structures. We have observed no significant
come from the CPU, which has a much higher demand for CPU scalability impact using STT-MRAM with L0 implemented. A few
stores than the memory cache line fills. Also, the STT-MRAM L1 benchmarks show improved scalability up to 16 cores using the
suffers from longer read access latency due to the longer MTJ STT-MRAM hierarchy.
sensing time. In this paper, the small fully associative cache is put The introduction of new memory technologies can have
on the core side. This is to first improve the bandwidth of data significant impacts on the best architectural choices for the
flowing to the STT-MRAM L1 cache by merging processor writes memory hierarchy of a multicore system. This paper shows that
into cache line writes, similar to the write aggregation schemes in simple solutions can help mitigate the negative impacts while still
[34, 35], and to also provide overall faster cache access due to its allowing the system to take advantage of the benefits of the new
simplicity and small capacity. technology.

5Conclusion 6References
We have analysed the impact of STT-MRAM as a replacement for [1] Park, S.P., Gupta, S., Mojumder, N., et al.: Future cache design using STT
CMOS at all levels of a multiprocessor cache hierarchy. Though MRAMs for improved energy efficiency: devices, circuits and architecture.
DAC'12: Proc. of the 49th Annual Design Automation Conf., 2012
STT-MRAM has higher write energy and latency, reducing these [2] Smullen, C.W.I., Mohan, V., Nigam, A., et al.: Relaxing non-volatility for
parameters at the circuit level does not lead to an optimal design. fast and energy-efficient STT-RAM caches. 2011 IEEE 17th Int. Symp. on
The extra circuit area required to minimise MTJ bit-cell write time High Performance Computer Architecture (HPCA), 2011, pp. 5061
and energy causes the cache arrays to grow, leading to higher read [3] Rasquinha, M., Choudhary, D., Chatterjee, S., et al.: An energy efficient
cache design using spin torque transfer (STT) RAM. ISLPED'10: Proc. of the
energy and latency due to parasitic effects. 16th ACM/IEEE Int. Symp. on Low Power Electronics and Design, 2010

58 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59


The Institution of Engineering and Technology 2016
[4] Sun, Z., Bi, X., Li, H.H., et al.: Multi retention level STT-RAM cache [21] Li, Q., Li, J., Shi, L., et al.: Compiler-assisted refresh minimization for
designs with a dynamic refresh scheme. MICRO-44'11: Proc. of the 44th volatile stt-ram cache. 2013 18th Asia and South Pacific Design Automation
Annual IEEE/ACM Int. Symp. on Microarchitecture, 2011 Conf. (ASP-DAC), 2013, pp. 273278
[5] Guo, X., Ipek, E., Soyata, T.: Resistive computation: avoiding the power wall [22] Xu, W., Sun, H., Wang, X., et al.: Design of last-level on-chip cache using
with low-leakage, STT-MRAM based computing. ISCA'10: Proc. of the 37th spin-torque transfer RAM (stt RAM), IEEE Trans. Very Large Scale Integr.
Annual Int. Symp. on Computer Architecture, 2010 (VLSI) Syst., 2011, 19, (3), pp. 483493
[6] Senni, S., Torres, L., Sassatelli, G., et al.: Emerging non-volatile memory [23] Kim, Y., Gupta, S.K., Park, S.P., et al.: Write-optimized reliable design of
technologies exploration flow for processor architecture. 2015 IEEE STT MRAM. ISLPED'12: Proc. of the 2012 ACM/IEEE Int. Symp. on Low
Computer Society Annual Symp. on VLSI (ISVLSI), 2015, p. 460 power Electronics and Design, 2012
[7] Jog, A., Mishra, A.K., Xu, C., et al.: Cache revive: architecting volatile STT- [24] Sun, G., Dong, X., Xie, Y., et al.: A novel architecture of the 3D stacked
RAM caches for enhanced performance in CMPs. DAC'12: Proc. of the 49th MRAM l2 cache for CMPs. IEEE 15th Int. Symp. on High Performance
Annual Design Automation Conf., 2012, pp. 243252 Computer Architecture, 2009. HPCA 2009, 2009, pp. 239249
[8] Kim, J., Zhao, H., Jiang, Y., et al.: Scaling analysis of in-plane and [25] Zhou, P., Zhao, B., Yang, J., et al.: Energy reduction for STT-RAM using
perpendicular anisotropy magnetic tunnel junctions using a physics-based early write termination. ICCAD'09: Proc. of the 2009 Int. Conf. on
model. Device Research Conf. (DRC), 2014, 2014 Computer-Aided Design, 2009
[9] Tuohy, W., Ma, C., Nandkar, P., et al.: Improving energy and performance [26] Kwon, K.-W., Choday, S.H., Kim, Y., et al.: AWARE (asymmetric write
with spintronics caches in multicore systems. Europar'14: OMHI Third architecture with REdundant blocks): a high write speed STT-MRAM cache
Annual Workshop on On-Chip Memory Hierarchies and Interconnects, 2014 architecture, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2013, 22, (4),
[10] L. Hewlett-Packard Development Company: Cacti 6.5, 2009, Available at pp. 712720
http://www.hpl.hp.com/research/cacti/ [27] Sun, Z., Li, H., Wu, W.: A dual-mode architecture for fast-switching STT-
[11] Zhao, W., Cao, Y.: New generation of predictive technology model for RAM. ISLPED'12: Proc. of the 2012 ACM/IEEE Int. Symp. on Low Power
sub-45nm design exploration. Seventh Int. Symp. on Quality Electronic Electronics and Design, 2012
Design, 2006. ISQED'06, 2006, p. 6 [28] Ahn, J., Yoo, S., Choi, K.: Dasca: dead write prediction assisted stt-RAM
[12] Dong, X, Xu, C., Xie, Y., et al.: NVSim: a circuit-level performance, energy, cache architecture. 2014 IEEE 20th Int. Symp. on High Performance
and area model for emerging nonvolatile memory, IEEE Trans. Comput. Computer Architecture (HPCA2014), February 2014
Aided Des. Integr. Circuits Systems, 2012, 31, (7), pp. 9941007 [29] Wu, X., Li, J., Zhang, L., et al.: Power and performance of readwrite aware
[13] Genbrugge, D., Eyerman, S., Eeckhout, L.: Interval simulation: raising the hybrid caches with non-volatile memories. Design, Automation Test in
level of abstraction in architectural simulation. 2010 IEEE 16th Int. Symp. on Europe Conf. Exhibition, 2009. DATE'09, 2009, pp. 737742
High Performance Computer Architecture (HPCA), 2010, pp. 112. Available [30] Jadidi, A., Arjomand, M., Sarbazi-Azad, H.: High-endurance and
at http://www.dx.doi.org/10.1109/hpca.2010.5416636 performance-efficient design of hybrid cache architectures through adaptive
[14] Bienia, C.: Benchmarking modern multiprocessors. PhD thesis, Princeton line replacement. ISLPED'11: Proc. of the 17th IEEE/ACM Int. Symp. on
University, January 2011 Low-power Electronics and Design, 2011
[15] Woo, S.C., Ohara, M., Torrie, E., et al.: The splash-2 programs: [31] Del Bel, B., Kim, J., Kim, C., et al.: Improving stt-MRAM density through
characterization and methodological considerations. Proc. of the 22nd multibit error correction. Design, Automation and Test in Europe Conf. and
Annual Int. Symp. on Computer Architecture, ISCA'95 mct, New York, NY, Exhibition (DATE), 2014, 2014, pp. 16
USA, 1995, pp. 2436. Available at http://www.doi.acm.org/ [32] Jouppi, N.P.: Improving direct-mapped cache performance by the addition of
10.1145/223982.223990 a small fully-associative cache and prefetch buffers, ACM SIGARCH
[16] Bienia, C., Kumar, S., Li, K.: PARSEC vs. splash-2: a quantitative Comput. Archit. News, 1990, 18, pp. 364373
comparison of two multithreaded benchmark suites on chip-multiprocessors. [33] Kin, J., Gupta, M., Mangione-Smith, W.H.: The filter cache: an energy
IEEE Int. Symp. on Workload Characterization, 2008. IISWC 2008, 2008, pp. efficient memory structure. Proc. of the 30th Annual ACM/IEEE Int. Symp.
4756 on Microarchitecture, MICRO 30, Washington, DC, USA, 1997, pp. 184193.
[17] Alameldeen, A.R., Wood, D.A.: IPC considered harmful for multiprocessor Available at http://www.dl.acm.org/citation.cfm?id=266800.266818
workloads, IEEE Micro, 2006, 26, (4), pp. 817 [34] Varma, A., Jacobson, Q.: Destage algorithms for disk arrays with non-
[18] Chun, K.C., Zhao, H., Harms, J.D., et al.: A scaling roadmap and volatile caches. 22nd Annual Int. Symp. on Computer Architecture, 1995.
performance evaluation of in-plane and perpendicular MTJ based STT- Proc., 1995, pp. 8395
MRAMs for high-density cache memory, IEEE J. Solid-State Circuits, 2013, [35] Gill, B.S., Modha, D.S.: Wow: wise ordering for writes combining spatial
48, (2), pp. 598610 and temporal locality in non-volatile caches. Proc. of the Fourth Conf. on
[19] Gonzales, R., Horowitz, M.: Energy dissipation in general purpose USENIX Conf. on File and Storage Technologies Volume 4, FAST'05,
processors, IEEE J. Solid State Circuits, 1995, 31, pp. 12771284 Berkeley, CA, USA, 2005, p. 10
[20] Bhadauria, M., Weaver, V.M., McKee, S.A.: Understanding PARSEC
performance on contemporary CMPs. IEEE Int. Symp. on Workload
Characterization, 2009. IISWC 2009, 2009, pp. 98107

IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 2, pp. 51-59 59


The Institution of Engineering and Technology 2016
Copyright of IET Computers & Digital Techniques is the property of Institution of
Engineering & Technology and its content may not be copied or emailed to multiple sites or
posted to a listserv without the copyright holder's express written permission. However, users
may print, download, or email articles for individual use.

You might also like