You are on page 1of 6

2017 22nd IEEE European Test Symposium (ETS)

Exploiting STT-MRAM for Approximate Computing

Nour Sayed Fabian Oboril Azadeh Shirvanian Rajendra Bishnoi Mehdi B. Tahoori
Karlsruhe Institute of Technology (KIT) Karlsruhe, Germany
E-mail: {nour.sayed, fabian.oboril, azadeh.shirvanian, rajendra.bishnoi, mehdi.tahoori}@kit.edu

Abstract—Spin Transfer Torque Magnetic RAM (STT- continuously flows for the same duration resulting in a high
MRAM) is an emerging non-volatile memory technology and energy [5].
a potential candidate to replace SRAM in processor caches.
However, STT-MRAM suffers from a high write latency and Combining fast writing speed and low write energy to-
high write energy consumption which have to be addressed for gether is not yet trivial. Consequently, there are a variety of
energy-efficient on-chip caches. The non-volatility property of different approaches at circuit- and architecture-level to im-
STT-MRAM can be relaxed by reducing the thermal stability prove the characteristics of STT-MRAM [6–10]. An interesting
factor to improve both the write latency and write energy of solution in that regard is to relax the non-volatility property of
STT-MRAM. However, this leads to increase in retention failure STT-MRAM as this improves both writing speed and energy.
and read disturb rates resulting in erroneous data stored in the
cache. This problem can naturally be mitigated in the scope of
On the other hand, this increases the error rate due to retention
approximate computing in which such errors can be tolerated at failures as well as read disturb [11].
the application level. In this paper, we show how STT-MRAM In this paper, we show the potential of the STT-MRAM
technology can effectively be used for approximate computing
by tuning technology and application parameters to achieve an
technology in the scope of approximate computing which
acceptable level of correctness with significant gains. Results show naturally fits with the aforementioned constraints. In general,
that using our proposed approximate computing framework, the approximate computing relaxes the bounds of accurate com-
per-access write latency and energy can be improved up to 25% puting for the applications which are inherently error resilient
and 70%, respectively. and in return, it improves the efficiency of the system by other
means such as performance and energy improvements [12, 13].
I. I NTRODUCTION Many applications, for instance in audio or image processing
domains, possess an intrinsic fault tolerance, and thus can
Energy consumption has emerged to be a major design deal with noisy (i.e. erroneous) data. For example, when
constraint for modern integrated circuits, in particular in low- the output of multimedia (images, videos, etc.) processing
power application domains such as the “Internet of Things”. algorithms is left to human perception strict exactness may
The static power, which is the dominating component in not be required and imprecise results may often be sufficient.
the total energy consumption, is the leading roadblock in Hence, with approximate computing, an increasing error rate
these fields [1]. Therefore, the stringent energy targets cannot can be tolerated to improve the write energy/latency of STT-
be achievable using conventional SRAMs, because of the MRAM with minimal cost. Moreover, even the overall system
high leakage. As pointed out by the International Technology energy efficiency can further be improved due to shorter
Roadmap for Semiconductors [2], one of the most promising execution times thanks to the improved cache access latencies.
solutions to overcome this power challenge is the integration of
non-volatile memories, where the data can be retained without Our simulation results show that using the proposed ap-
any power supply. proximate computing framework, the per-access write energy
of STT-MRAM can be reduced by 70 %. Applying this to the
Among the emerging non-volatile memory technologies, data-cache of a microprocessor improves the overall system
Spin Transfer Torque Magnetic Random Access Memory (STT- performance by 10 % and 25 % for 2 GHz and 1 GHz core
MRAM) is one of the most promising technologies for future frequencies, respectively.
embedded memories. STT-MRAM is a novel and unique stor-
age technology based on a magnetic device (called Magnetic The remainder of this paper is organized as follows. In
Tunnel Junction or simply MTJ) that exploits not only the Section II, the basics of STT-MRAM is presented and the idea
charge of electrons but also their spin to store digital content. of approximate computing is introduced. Afterwards, in Sec-
As a result, STT-MRAM promises almost zero static power, tion III, the proposed framework using approximate computing
fast access latencies (almost as fast as SRAM) and high for STT-MRAM is explained, followed by a comprehensive
integration density (as good as DRAM) [3, 4]. Thus, STT- experimental study in Section IV. Finally, Section V concludes
MRAM has the potential to replace SRAM in the memory the paper.
hierarchy of the computing system. II. P RELIMINARIES
However, in order to employ STT-MRAM as a promising A. Basics of STT-MRAM
low-power solution still several fundamental challenges have
to be resolved. In particular, it is of decisive importance to In STT-MRAM, data is stored in a Magnetic Tunnel
reduce the write energy as well as the write latency of STT- Junction (MTJ) cell, which consists of two independent ferro-
MRAM [1, 2]. In fact, the MTJ cells require rather high current magnetic layers separated by a thin oxide layer. The magnetic
for a long duration to switch their magnetization. Additionally, orientation of one layer, the free layer, can be freely rotated,
the stochastic switching nature of the MTJ cells necessitates a while the magnetization of the other layer, the reference layer,
longer timing margin, which further extends the total write is fixed. Thus, the magnetization of the free layer can be in
period. The overall write latency is significantly increased parallel or anti-parallel to the reference layer. As a result, the
due to their long write period as well as the write current electric resistance of the MTJ changes to high for anti-parallel

978-1-5090-5457-2/17/$31.00 ©2017 IEEE


!

!
Bit-Line Bit-Line
−12 −7
x 10 x 10
3 5 1 4
Energy Retention Failure

Write Latency [ns]

Retention Failure
Write Energy [J]
Latency Read Disturb

Read Disturb
Probability

Probability
Free Layer 2 4.5

Low Resistance High Resistance 0.5 2


MgO
Parallel Anti-Parallel 1 4

Reference Layer
0 3.5 0 0
20 30 40 50 60 20 30 40 50 60
Write-0 Write-1 Thermal Stability Factor (∆) Thermal Stability Factor (∆)
Word-Line Word-Line
IW > IC IW > IC (a) (b)
Fig. 2: Effects of thermal stability factor on (a) write energy
Source-Line Source-Line and latency, (b) failure rates (for setup see Sec. IV-A)
Fig. 1: Typical STT-MRAM bit-cell structure
C. Failure Rate Dependency on Thermal Stability Factor
and low for parallel. These two states are used to represent a
logic ’0’ and ’1’, respectively (see Fig. 1). With low ∆ value, the STT-MRAM access latency and
energy can be significantly reduced. However, it increases the
In order to read data in an STT-MRAM cell, a low current retention failure rates and possibility of read disturb, that are
flows through the MTJ to sense the resistance state of the cell. explained next.
Also, to write a data into an STT-MRAM cell, current flows
through the MTJ. However, the write current (Iw ) is much Retention Failure: The retention failures in STT-MRAM
higher (tens of µA) than the critical current (Ic ) (minimum happen due to the inherent thermal fluctuation of the MTJ cell,
current required to flow for a considerable amount of time to that can lead to switch its magnetic orientation. This can occur
switch magnetization at a certain write error rate). The final regardless of whether a memory access is performed or not.
magnetization can be controlled by the write current direction The retention failure probability (PRF ) for a given time period
(see Fig. 1). As a result of this high write current, the dynamic (t) can be computed according to [14] as:
write energy in STT-MRAM is very high [5]. This issue is t
further pronounced due to the stochastic nature of the writing PRF = 1 − exp[− ] (3)
τ · e∆
(switching) process as well as the high sensitivity to process
variation, which requires considerable timing margins [7]. where τ is a constant equal to 1 ns. According to this equation,
a relatively high ∆ can significantly improve the reliability.
B. Retention Time in STT-MRAM Read Disturb: Since both read and write currents share
In general, the retention time in non-volatile memories the same path in the STT-MRAM bit-cell, the read current
refers to the data retaining capability of their bit-cell regard- can accidentally switch the bit-cell content during the read
less of their powered-on or powered-off conditions. In STT- operation time (tr ). This phenomenon is known as read disturb.
MRAM, this retention time depends on the thermal stability Its probability (PRD ) is strongly dependent on ∆ according
factor (∆) value of the MTJ cell. The higher the ∆ value, the to [16]:
tr
longer the data can stay in the bit-cell. For a value of 60, it PRD = 1 − exp[− ] (4)
∆(1− IIr )
can retain the bit-cell content for 10 years [14]. The ∆ value τ ·e c

can be modeled as: In this equation, ∆ has an exponential influence on the


V · H k · Ms probability of the read disturb. In other words, small reduction
∆= (1)
2 · KB · T in ∆ increases the read disturb probability significantly.
where V is the volume of the free layer, Ms is the saturation Write Error: The write errors in STT-MRAM are typically
magnetization, KB is the Boltzmann constant, T is the tem- occurring due to the stochastic switching nature of their bit-
perature in kelvin and Hk is the effective field anisotropy. As cells. However, the usage of a smaller ∆ value can make the
shown in equation (1), the ∆ value is directly proportional to switching of MTJ cell faster due to the reduction in Ic value,
the volume of the MTJ cell, which can be tuned by changing hence, the required switching time (tw ) of MTJ reduces for
MTJ size during fabrication. Another way is by adjusting the same target Write Error Rate (WER), which is apparent
the Ms and Hk values at the material level during the stack with following equation [17]:
development.
−π 2 · (I − 1) · ∆ Iw
An MTJ cell with high ∆ value requires high switching WERbit (tw ) = 1 − exp[ ], I= (5)
4(I · eC(I−1)tw − 1) Ic
latency and energy. This is due to the fact that the height of
thermal barrier is more for a high ∆ value, hence requires more Here C is a technology dependent parameter and I is the ratio
current for a longer duration to perform switching. Its relation of the write current (Iw ) to the critical current (Ic ).
with Ic can be modeled using the following equation [15]:
Fig. 2 shows the effects of relaxing the thermal stability
4 · e · KB · T α 4 · π · Mef f factor on the error rate along with the MTJ switching time.
Ic = · · ∆ · (1 + ) (2) It is clear that low ∆ indicates that the cell is more likely to
h η 2 · Hk
suffer from errors at read access or during the retention time.
where h is the Planks constant, α is the LLG damping constant, Therefore, in this work, we deal with the retention failures and
η is STT-MRAM efficiency parameter and 4πMef f is the read disturb rates to trade-off with the energy and performance
effective demagnetization field. The impact of the ∆ value on improvements due to reduction in ∆ value. On the other hand,
switching latency and energy for a single bit-cell is illustrated smaller ∆ value has a positive impact on the write access
in Fig. 2 (a). Since for cache applications a high retention time unlike the case of the retention failure and read disturb. Hence,
is not required, this trade-off is usually exploited to deliver a we are quantifying this impact by improving the write latency
high-performance and low-energy STT-MRAM architecture. as well as energy for a fixed WER.
!

!
D. Approximate Computing Device- value
level (Device parameter dependent on MTJ size)
The Approximate Computing concept has recently emerged Failure Switching Switching
as a promising approach which relies on relaxing the bounds rates latency energy
of precise computing to improve the energy, and performance
efficiency by orders of magnitude. This is done by leveraging Architecture- Fault Cache latencies Cache energy
level
the applications resilience to the errors and producing an injection (in cycles) (per access)

acceptable output quality within the tolerable ranges instead of


error-free output. Furthermore, many of the applications, which Application- SNR analysis Performance Energy
are data intensive and hence on-chip memory-exhausting, level
analysis analysis
provide opportunities to exploit approximate computing very
efficiently due to the huge existing gap between the level of Fig. 3: Illustration of cross layer approach for approximate
accuracy required by the perceptual limitations of humans and computing in STT-MRAM.
that provided by the system, such as image, audio and video
process variation results in considerable variations in the output
processing applications. Thus, for such applications, AC has
accuracy at architecture-level. In our work, we explore a new
the potential to significantly optimize the behavior of the on-
approximation for STT-MRAM design to achieve substantial
chip memory in terms of performance and energy. In this work,
improvements in both energy consumption and performance
we employ this approximate computing scheme in memories
for negligible loss in the application quality by lowering the
at architecture-level and perform the evaluation using image
thermal stability factor of MTJ cell based on the application
processing applications.
requirements. A preliminary version of our work was published
E. Related Work in [24] which provides some insight into the impact of our idea
on the performance, energy consumption and reliability of the
For the high-performance and energy-efficient STT-MRAM STT-MRAM design.
implementation, the prior research efforts have explored sev-
eral techniques to achieve energy efficiency at various levels III. A PPROXIMATE COMPUTING IN STT-MRAM
of STT design. At device-level, the extreme write margin This section presents our proposed approximate computing
can be addressed by reducing the thermal stability factor [4], methodology for STT-MRAM. First, the overview is explained,
however, this method dramatically affects the retention time followed by the detailed description of each abstraction com-
of the memory and can increase the overall failure rate of ponent of our proposed methodology to trade-off accuracy
STT-MRAM. Several circuit-level techniques [9, 10, 18, 19] versus performance and energy. In the end, we explain, how
have been proposed to detect the actual switching time of some parts of data, which are critical and reliable, have to be
an STT-MRAM bit cell and stop the unnecessary current protected in our framework.
flow immediately after the write completion. Although such
techniques significantly reduce the write energy, the overall A. Overview
write latency remains the same. At the architecture level, the Since on-chip memories (e.g., L1 cache) often do not
write margin is reduced significantly by exploiting incomplete require long data retention periods due to frequent accesses,
write operations [20, 21]. The unfinished bits later have to be the thermal stability factor of bit-cells in those memories
processed by robust Error Correction Codes (ECCs). However, can be reduced, which lowers the data retention time. As
such coding schemes have relatively large decoding latencies a result, the write latency will be reduced, and also, the
that can impair the memory read latency. All above efforts have amount of current required to switch the bit-cell content will
been focused on addressing the inefficient write operation in be smaller. Consequently, the overall write energy will become
STT while preserving the accuracy of read/write operations. significantly better. However, the drawback of this approach
is an exponential increase in the retention failure and read
Regarding the approximate computing topic, the previous disturb rates. Therefore, to analyze this trade-off, we have
work on the approximate memory, focuses on finding a trade- developed a cross-layer framework (as demonstrated in Fig. 3),
off between errors in the content of the saved or read data in which we carried forward the impact of ∆ reduction to the
and the energy consumption. In the prior works of DRAM memory architecture and application-level. We have calculated
memory [12, 13], the energy consumption is optimized by the impact of various ∆ values on failure rates as well as
relaxing the refresh rate which reduces the refresh power. write energy and latencies at the device level (similar to
However, the retention failure increases as a cost of this Fig. 2). These values are projected at the architecture-level
method. In [22], the energy reduction is achieved by reducing for per access energy and delay as well as failure rates. For
the number of pulses used to write Phase Change Memories. this, we have performed a fault injection model at memory
For STT-MRAM memory, the prior efforts are very limited. architecture-level to accurately mimic the possibility of read
In [23], the approximate computing is leveraged for STT- and retention failures. At system-level, the application output
MRAM design improvement in which the MTJ cell is accessed is evaluated using the Signal to Noise Ratio (SNR) metric.
with different levels of accuracy (i.e., different current levels) This way we can find the optimal value of ∆ for the best
based on the application requirements. Hence the approxima- trade-off between acceptable quality of the output (correctness)
tion is adopted in order to reduce the dynamic energy of read as well as the performance and energy improvement. In the
and write operations by reducing both sensing and switching context of approximate computing, we allowed erroneous
currents at the cost of higher read and write error rates. (i.e. inaccurate) data to be fetched by the processor. Hence,
The main limitation of this work is the effect of the process leveraging approximate computing can improve the writing
variation which is totally ignored. Process variation in STT- energy and latency of STT-MRAM. Also, as the write latency
MRAM affects device parameters of both MTJ cell and CMOS of STT-MRAM is reduced, the overall system energy efficiency
transistor which in turn impacts the main parameters of read can be further improved due to shorter execution times.
and write operations (i.e., the switching current, the thermal
stability factor and the output current of the transistor). Conse- Overall, in this work, we analyze the range of parameters
quently, with different and tiny read-write access current levels, from both STT-MRAM technology, system architecture and
!

!
TABLE I: The overall mean time to failure (MTTF) for various Signal to Noise Ratio (SNR) between the approximated image
∆ values (for setup see Sec. IV-A) (faulty image) and golden image (faulty-free image). In other
∆ MTTF [s] ∆ MTTF [s]
words, SNR can be served as a quality measure for the
20 9.22X10−7 34 1.11 approximated image, which is calculated based on the variance
25 1.37X10−4 35 3.01 of the signal (i.e., the golden image) (σS2 ) and the variance of
2
31 5.52X10−2 40 447.59 the noise (i.e., the faulty image) (σN ) as Equation. (6), and is
32 1.50X10−1 45 6.7X10+4 expressed in decibel (dB). q
33 4.08X10−1 60 2.17X10+11 2 σS
SN R = 20 log10 q (6)
2
σN
application requirements to achieve acceptable level of quality
while maximizing energy and performance gains. For that pur-
pose, we employ image processing applications, in particular The value of SNR impacts directly the performance and energy
as a case study, we will deal with JPEG format. This is a lossy consumption of the proposed scheme. However, the minimum
compression method for digital images to store or transmit acceptable SNR value depends mainly on the application
data in an efficient form without losing the ability to re- requirements. For instance in image processing applications,
extract an acceptable version of the image. For this reason, the the minimum required SNR value of the faulty image SNR
JPEG format provides some inherent level of fault tolerance, Threshold (SNRth ), is set at the level where the human brain
which means that it fits very well to the idea of approximate and eyes can differentiate between a faulty image and a golden
computing. one. This means, if the SNR value of the output is less than
SNRth , it is considered as an unacceptable output quality. See
B. Adaptation of Thermal Stability Factor Fig. 4 where SNR for an acceptable JPEG image quality has
to be more than SNRth = 50. However, for more performance
As discussed in Sec. II-C, the thermal stability factor has and energy improvements, other smaller SNR values can be
a strong influence on the reliability of STT-MRAM cell at considered at the cost of less but still acceptable output quality.
both idle and access time because of the reduction in Ic . The corresponding pseudo-codes for generation of golden and
Consequently, various critical optimizations for STT-MRAM faulty images are presented in Algorithm 1 and Algorithm 2.
design can be performed, namely i) performance optimization
which is obtained due to the reduction of the required time E. Protection of Critical Data
to switch the MTJ resistance state (i.e., write latency tw ),
ii) as the write access power of STT-MRAM depends on A major challenge of using an approximate storage is
that most applications which are highly amenable to the
Ic as: Pw = Ic 2 · RM T J , where the resistance state of
approximate computing paradigm, have a mixture of control
MTJ cell is fixed, a low Ic value leads to a reduction in
data, so-called the critical data, which cannot tolerate any
power of the STT-MRAM cell, and iii) the energy consumption
errors and the data that may be approximated or unreliable.
(Ew = Pw · tw ) can also be dramatically reduced because of
Since the workloads in our case study are image processing
the small tw and Pw values, since they are the outcomes of
applications (JPEG), which uses a lossy compression technique
low ∆ value. Fig. 2 shows the effects of ∆ value reduction on
to produce a similar coloured image with reduced size and
failure probabilities, write latency and write energy. According
an acceptable quality, any imprecision on region of code that
to this figure, as the value of ∆ decreases, both the write
stores the meta-data (the header of the image) totally corrupts
access latency and write energy decrease at the cost of higher
the output image. This means header data is controlling the
retention failure. In our framework, for the ∆ reduction from
image data and should not be approximated. Whereas, the
60 to 20, the improvements of the write latency and energy per
compressed image data (i.e., quantization) has tolerance to
access reach up to 25% and 70%, respectively. Whereas, the
imprecision and any errors in it may only lead to some degree
retention failure probability increases significantly and become
non-acceptable for the ∆ value below 32 (see Fig. 2).
C. Fault Injection Approach Algorithm 1 Golden image generation
Input Error-free output constraint {∆max }
There are retention failures and read disturb failures as- Output Error-free image
sociated with low ∆ values for the STT-MRAM memory. 1: Set the corresponding STT-MRAM based data cache configurations
Therefore, an accurate fault injection model has to be built 2: Execute the simulation without fault injection mode
based on the combined failure rates, i.e., Failures In Time 3:return Error-free image
(FIT). The combined FIT rate has to be converted into access-
based failure probability. The accesses in memory (e.g., L1 Algorithm 2 Faulty image generation
data cache) are performed by read and write requests. However, Input Approximate computing constraints {∆min , ∆max , SN Rth }
we perform the fault injection model at only read access, and the maximum number of experiments (N)
since it is more frequent compared to the write access. We Output Acceptable and non-acceptable percentage of image quality
compute the Fault Injection Rate (FIR) per read access based 1: For ∆ from ∆min to ∆max do
on FIT value of the adopted ∆ and the Read Rate (RR) as 2: Set the corresponding STT-MRAM based data cache configurations
3: Calculate the fault injection rate per read access (FIR)
follows: FIR=FIT/RR. Table I provides the combined FIT rate 4: For n from 1 to N do
corresponding to ∆ values from 20 to 60. 5: Execute the simulation with fault injection mode
7: Calculate SN R
D. Tolerate Acceptable Error Rate Using Approximate Com- 8: if SN R > SN Rth then
puting 9: Increase the acceptable quality percentage
10: else
To define the accepted value of the lowered ∆ for cache 11: Increase the non-acceptable quality percentage
memories, we have to estimate the acceptance quality of the 12: end if
extracted image by observing the influence of the ∆ value on 13: end for
14: end for
the error rate which will be translated into the read-access- 15: return Acceptable and non-acceptable percentage of image quality
based injected failures. The metric is based on the computed
!

(a) Golden image (b) Acceptable image qualtiy (c) Non-acceptable image quality (d) Non-acceptable image quality
∆ = 32 & SN R = 50 ∆ = 26 & SN R = 23 ∆ = 27 & SN R = 19
Fig. 4: Images of various ∆ and SNR values for STT-MRAM data cache for “jpegtran” workload

100 100
Image Quality in %

Image Quality in %
75 75

50 50

25 djpeg 25 djpeg
jpegtran jpegtran
cpeg cpeg
0 0
10 20 30 40 50 60 70 10 20 30 40 50 60 70
Thermal Stability Factor ∆ Thermal Stability Factor ∆

(a) 1 GHz clock frequency (b) 2 GHz clock frequency


Fig. 5: Acceptable image quality for various ∆ values

of quality loss in the output image. Therefore, the stored data with a fault injection framework in order to support the
has to be classified into reliable and unreliable data, which retention time of each particular value of ∆ adopted for an
are header and quantization, respectively. There are various STT-MRAM based data cache. Whereas, the instruction cache
solutions to protect the critical part of the data. has to be faulty-free. This is why, we assumed ∆ = 60 for
the instruction cache to guarantee high retention time. This
The critical data size is very small compared to that of assumption is reasonable, as the number of write accesses
the non-critical data. To make critical data error resilient, one to the instruction cache is considerably lower. Therefore, the
way is to design a heterogeneous memory array for the data performance and energy overheads of adopting higher ∆ for
cache which has high and low ∆ values. In this design, the such this cache would be negligible. Table II summarize the
critical data has to be stored to the cells of high ∆ to guarantee set-up for our experiments. The evaluations are performed by
the error free operation. However, this requires changes to the running 3 applications of image processing from MiBench
fabrication parameters of the two arrays as well as complex applications 100 times in order to make our fault injection
cache controller to allocate data to different array. A less model non-deterministic. Each output for each experiment has
complicated way to protect the critical data is either to use a different level of quality loss according to the error rate of
multiple copies of the content of this data, such as dual or triple the used ∆. We use SNR as the metric to evaluate the quality
modular redundancy, or to use Error Correction Code (ECC), of the applications output.
which protects the data by adding check bits. Since the size
of critical data is significantly smaller than approximate-able B. SNR Evaluations
data, the overhead due to extra bit-cells of either the repeated
bits of dual or triple modular redundancy or the check bits of The SNR measurements reveal the degree at which an
ECC approach is minimal. application could produce satisfactory and reasonable output
due to the use of the relaxed thermal stability factor of STT-
MRAM in data cache for approximate applications. This is
IV. SIMULATION RESULTS to determine the achievable limits of the performance and
In this section, we present the experimental analysis to energy improvements for STT-MRAM technology. Our anal-
show the benefits of using STT-MRAM in approximate com- ysis depends on extracting 100 different values of SNR for
puting by using image processing applications (JPEG) as case each ∆ value up to 70 for the all three images applications.
study. Fig. 4 shows acceptable and non-acceptable faulty image
quality along with the original one. After observing the images
A. Simulation Setup along with the related SNR, we can define the values of
For the bit-cell characterization, we extracted the read SN Rth which are 50, 50 and 60 for djpg, jpegtran and
and write latencies for STT-MRAM by SPICE simulation cjpeg, respectively. This means that the faulty image can be
using TSMC 65nm general purpose transistor models and the considered as acceptable, if its SNR value is greater than
perpendicular STT-MRAM model presented in [25]. For the SN Rth (as explained in Algorithm 2).
bit-cell in L1 data cache, we used ∆ from 10 up to 70 in
order to determine the minimum value that can be adopted TABLE II: Simulation setup in gem5
for optimal results. For our evaluations, we employed the gem5 confiquration ISA x86
gem5 simulator [26]. gem5 is a full-system, cycle-accurate Processor Single-core, 0.5/1/2 GHz, Out-of-order, 4-issue
performance simulator that supports all levels of the memory L1 data-cache
64 KB, 2-way set associative, 64B line size
STT-MRAM, different write latencies and ∆ values
hierarchy with various configurations such as capacity, asso- 32 KB, 2-way set associative, 64B line size
ciativity, latency, and block size. We extended the simulator L1 instruction-cache
STT-MRAM, ∆ = 60
MiBench Applications [27] djpeg, cjpeg, jpegtran
!

!
120 120
Relative Changes in %

Relative Changes in %
100 100
Approximate Approximate
80 80 Computing Region
Computing Region
60 60

40 40
Write Energy Write Energy
20 Performance 20 Performance
Image Quality Image Quality
0 0
10 20 30 32 34 40 50 60 70 10 20 30 32 34 40 50 60 70
Thermal Stability Factor ∆ Thermal Stability Factor ∆

(a) 1 GHz clock frequency (b) 2 GHz clock frequency

Fig. 6: Relation between image quality, performance (runtime) and write energy for an STT-MRAM based data cache

According to the SNR measurements, we count the ac- to determine the optimal thermal stability factor in order to
ceptable images from the entire 100 faulty images related to find the right balance between acceptable accuracy as well
each ∆ value for all the image applications (see Fig. 5). This as energy and performance gains. Our results show that the
image shows that for ∆ ≤ 25, the percentage of acceptable energy reduction of 46% along with performance gain of up
quality is zero, for the three image applications. Whereas, for to 25% can be achieved at system level with a reasonable
∆ ≥ 40, all the images are acceptable. For ∆ values equal output quality.
to 32, 33 and 34, our results show that the percentage of the VI. ACKNOWLEDGEMENT
acceptable quality is more than 95%. Therefore, values of ∆
can be considered for approximate computing, in which the This work was partly supported by the European Commis-
output with quality loss less than 5%. Please note that in our sion under the Horizon-2020 Program as part of the GREAT
framework, the retention failure rate is dominating over the project (http://www.great-research.eu/) and by ANR/DFG as
read disturb rate for the SNR measurements (see Fig. 2 (b)). part of the MASTA project.
For the applications with faster read requirements, higher read
R EFERENCES
currents are necessary, and consequently, the read disturb rate
will be higher. In such cases, the SN Rth evaluation criteria can [1] N. Kim, et al., “Leakage current: Moore’s law meets static power,” computer, pp.
68–75, 2003.
be changed, and accordingly, the percentage of the acceptable [2] International Technology Roadmap for Semiconductors, http://www.itrs.net, 2013.
quality can be altered. [3] K. Wang, et al., “Low-power non-volatile spintronic memory: STT-RAM and
beyond,” Journal: Applied Physics, 2013.
[4] Driskill-Smith, et al., “Latest advances and roadmap for in-plane and perpendicular
C. Performance and Energy Analysis of Optimal ∆ for Ap- STT-RAM,” in International Memory Workshop, 2011, pp. 1–3.
proximate Computing [5] S. Fujita, et al., “Technology Trends and Near-Future Applications of Embedded
STT-MRAM,” in 2015 IEEE International Memory Workshop, May 2015, pp. 1–5.
The performance influence of the lower ∆ value for STT- [6] Z. Sun, et al., “Multi Retention Level STT-RAM Cache Designs with a Dynamic
Refresh Scheme,” in Micro, 2011, pp. 329–338.
MRAM data cache is closely associated with the processor [7] A. Ahari, et al., “Improving Reliability, Performance, and Energy Efficiency of
frequency. On one hand, for a given frequency, the write STT-MRAM with Dynamic Write Latency,” in ICCD, Oct. 2015, pp. 109–116.
[8] R. Venkatasan, et al., “Energy-Efficient All-Spin Cache Hierarchy Using Shift-
latency is reduced by 0.5 ns when ∆ is lowered to 32, as Based Writes and Multilevel Storage,” ACM JETC, pp. 4:1–4:24, 2015.
illustrated in Fig. 2 (a). On the other hand, the performance [9] R. Bishnoi, et al., “Avoiding Unnecessary Write Operations in STT-MRAM for
gain further increases by a low clock frequency of the pro- Low Power Implementation,” in ISQED, 2014, pp. 548–553.
[10] P. Zhou, et al., “Energy Reduction for STT-RAM Using Early Write Termination,”
cessor as well. For instance it reaches to 25% and 10% at pp. 264–268, 2009.
clock frequency 1 GHz and 2 GHz, respectively (see Fig. 6). [11] H. Naeimi, et al., “STT-RAM scaling and retention failure,” Intel Technology
Furthermore, according to our experiments, 66% of the total Journal, pp. 54–75, 2013.
[12] J. Lucas, et al., “Sparkk: Quality-scalable approximate storage in DRAM,” in The
energy of STT-MRAM for the L1 cache is consumed for write Memory Forum, 2014, pp. 1–9.
operations in data cache. Therefore, any improvement in the [13] S. Liu, et al., “Flikker: saving DRAM refresh-power through critical data partition-
ing,” ACM SIGPLAN Notices, pp. 213–224, 2012.
energy consumption of STT-MRAM cell in the data cache [14] C. W. Smullen, et al., “Relaxing non-volatility for fast and energy-efficient STT-
leads to the overall system energy efficiency. Results show that RAM caches,” in 2011 IEEE 17th HPCA, 2011, pp. 50–61.
we are able to improve the energy consumption per access up [15] Y. Jin,, et al., “Area, Power, and Latency Considerations of STT-MRAM to
Substitute for Main Memory,” in Proc. ISCA, 2014.
to 70% with a ∆ value 32. [16] D.Apalkov, et al., “Spin-transfer torque magnetic random access memory (STT-
MRAM),” JETC, p. 13, 2013.
Based on the performance and energy improvements of [17] K. Munira, et al., “A Quasi-analytical model for energy-delay-reliability tradeoff
the low ∆ value along with the output quality, we can define studies during write operations in STT-RAM,” Electron Devices, 2012.
[18] T. Zheng, et al., “Variable-energy write STT-RAM architecture with bit-wise write-
the optimal ∆ value for a particular system setup. Fig. 6 completion monitoring,” in ISLPED, 2013, pp. 229–234.
illustrates this trilateral relationship (i.e., energy, performance, [19] D. Suzuki, et al., “Cost-Efficient Self-Terminated Write Driver for Spin-Transfer-
quality). According to this figure, the optimal configuration Torque RAM and Logic,” Magnetics, pp. 1–4, 2014.
[20] Y. Emre, et al., “Enhancing the reliability of STT-RAM through circuit and system
for approximate computing applications, can be obtained by level techniques,” in 2012 Workshop on SiPS, 2012, pp. 125–130.
relaxing the thermal stability factor from the usually applied [21] X. Bi, et al., “Probabilistic design methodology to improve run-time stability and
60 down to 32, where the overall system energy reduction performance of stt-ram caches,” in ICCAD, 2012, pp. 88–94.
[22] A. Sampson, et al., “Approximate storage in solid-state memories,” TOCS, p. 9,
reaches to around 46% and the performance gains are 25% 2014.
and 10%, at clock frequency 1 GHz and 2 GHz, respectively. [23] A. Ranjan, et al., “Approximate storage for energy efficient spintronic memories,”
in 2015 52nd DAC, 2015, pp. 1–6.
[24] F. Oboril, et al., “Fault tolerant approximate computing using emerging non-volatile
V. CONCLUSIONS spintronic memories,” in VTS, 2016, pp. 1–1.
[25] A. Mejdoubi, et al., “A compact model of precessional spin-transfer switching for
We have developed a cross-layer framework, from tech- MTJ with a perpendicular polarizer,” in MIEL, 2012, pp. 225–228.
[26] N. Binkert, et al., “The gem5 simulator,” ACM SIGARCH Computer Architecture
nology parameters to architecture and application levels to News, pp. 1–7, 2011.
evaluate the applicability of STT-MRAM technology for ap- [27] M. R. Guthaus, et al., “MiBench: A free, commercially representative embedded
proximate computing. We have also provided a methodology benchmark suite,” in WWC-4. 2001 International Workshop on, pp. 3–14.

You might also like