You are on page 1of 8

IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO.

4, AUGUST 2014 1747

Multiple Cell Upset Classification


in Commercial SRAMs
G. Tsiligiannis, L. Dilillo, A. Bosio, P. Girard, S. Pravossoudovitch, A. Todri, A. Virazel, H. Puchner, C. Frost,
F. Wrobel, and F. Saigné

Abstract—While single bit upsets on memories and storage ele- the struck cell are also upset. This happens because of the re-
ments are mitigated with either the use of redundancy and/or error duced distances between the sensitive nodes of cells (as for ex-
correction codes, Multiple-Cell-Upsets (MCU) may become a sig- ample the source of the NMOS transistor of the inverted loop
nificant threat to the integrity of systems when the corrupted cells
in the SRAM memories); while at the same time the charge in-
belong to the same word. In this paper, we identify four types of
MCUs as they were recorded during several irradiations under an duced by the impinging particles remains relatively stable. The
atmospheric-like neutron beam (ISIS facility). An analysis is done importance of MCUs comes from the possibility of turning them
on the underlying reasons of occurrence of each MCU type, as well into Multiple Bit Upsets (MBU). MBUs are a particular case of
as their shapes and sizes in order to classify them. The results of this MCUs in which more than one bit flip occurs within a word. In
work concern a commercial 90 nm SRAM that was tested under this case, even the application of error detection and Error Cor-
an atmospheric neutron beam in static and dynamic mode. It is rection Codes (ECC) may not be sufficient to preserve the data
shown that, when the memory is in dynamic mode, not only the
integrity. In most of the memories that embed ECC techniques,
typical MCUs that involve a few flipped cells may appear but also
large clusters of upsets are possible to occur with hundreds or even the latter generally correct no more than one bit flip per word,
thousands of cells being affected. although they are able to detect two or three bit flips. The addi-
tion of the capability to correct a second bit flip within a word
Index Terms—Dynamic mode, multiple cell upsets (MCUs), neu-
tron irradiation, single event latchup (SEL), SRAM. would imply a large number of redundant bits.
Many studies investigating particle induced MCUs exist in
the literature [1]–[10]. Most of them are focused on MCUs ap-
I. INTRODUCTION pearing in SRAMs at both the simulation and the experimental
level. The failure mechanism of MCUs is described as follows:
T HE EFFECTS OF neutron induced upsets to Integrated
Circuits (ICs) have been extensively studied in the last
two decades. Until recently, a major source of errors induced by
secondary charged particles generated by a particle strike gen-
erate electron-hole pairs, inducing free charge in silicon. This
charge drifts and diffuses towards a sensitive region of the cell,
neutrons has been considered to be Single Event Upsets (SEUs),
generating a parasitic current and resulting to an upset (SEU).
where the value of the bit stored in a single cell of the memory
With respect to the nature of the ions and the Linear Energy
is reversed. However, with the downscaling of devices, Mul-
Transfer (LET), the parasitic generated charge may affect a
tiple Cell Upsets (MCU) have started to appear more often, af-
group of neighboring cells resulting to an MCU. For example,
fecting significantly IC robustness. MCUs are upsets induced
at the simulation level, [1] shows that the proton induced MCU
by a single impinging particle, in which cells neighboring to
cross section is related to the downscaling of the sensitive nodes
(cells) among other factors. In [2] another study is performed
Manuscript received September 30, 2013; revised December 24, 2013; ac- in which the SEU and MCU cross sections are calculated with
cepted March 21, 2014. Date of publication May 23, 2014; date of current ver- respect to the deposited charge in the SRAM cells. Reference
sion August 14, 2014. This work was supported by the French “Agence Na-
tionale pour la Recherche” (ANR) under the framework of the HAMLET project [3] explains the role of the triple-well in the MCU frequency
ANR-09-BLAN-0155-01 increase, due to the amplification of the collected charge.
G. Tsiligiannis, L. Dilillo, A. Bosio, P. Girard, S. Pravossoudovitch, A. Besides the work done at the simulation level, several studies
Todri, and A. Virazel are with the Laboratoire d’Informatique, de Robotique
et de Microelectronique de Montpellier (LIRMM) Universite de Montpellier
have confirmed the existence of MCU at the experimental level
II/CNRS, 34095 Montpellier Cedex 5, France (e-mail: tsiligiann@lirmm.fr; [4]–[10]. An extensive work where the MCU shapes and sizes
dilillo@lirmm.fr; bosio@lirmm.fr; girard@lirmm.fr; pravo@lirmm.fr; are analyzed according to the layout architecture is presented in
todri@lirmm.fr; virazel@lirmm.fr) [4]. The reported MCUs were observed by irradiating an SRAM
H. Puchner is with the Cypress Semiconductor, Technology R&D, San Jose,
CA 95134 USA (e-mail: hrp@cypress.com). with different neutron energies while the memory was in reten-
C. Frost is with the Rutherford Appleton Laboratory, Harwell Oxford Didcot, tion (static) mode. In [5], micro-Single Event Latchups (SEL)
OX11 0QX, U.K. (e-mail: christopher.frost@stfc.ac.uk). have been recorded under the form of big clusters of upsets.
F. Wrobel and F. Saigné are with the Institut d’Electronique du Sud, Uni-
This is due to the constraint of an occurring latchup to small re-
versite Montpellier II / CNRS, UMR-CNRS 5214, 34095 Montpellier Cedex 5,
France (e-mail: frederic.saigne@ies.univ-montp2.fr). gions of the memory delimited by well taps. In [6] MCUs were
F. Wrobel is with the Institut d’Electronique du Sud, Universite Montpellier observed in the form of big clusters of upsets during irradia-
II / CNRS, UMR-CNRS 5214, 34095 Montpellier Cedex 5, France and also with tion and read-back of a 90 nm SRAM with neutrons, while in
Institut Universitaire de France (e-mail: frederic.wrobel@ies.univ-montp2.fr).
Color versions of one or more of the figures in this paper are available online
[7] large blocks of upsets have been observed for SRAMs and
at http://ieeexplore.ieee.org. DRAMs. Finally, in [8] the effect of the device orientation is
Digital Object Identifier 10.1109/TNS.2014.2313742 explored with respect to the MBU cross sections.

0018-9499 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1748 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014

Fig. 2. Scheme of March C- test algorithm. Enclosed in parenthesis, there are


the operations of each element (1st element with a w0 operation, 2nd element
with r0w1 operations, etc), while the arrow indicates the direction of execution
of each element through the address space. March C- is composed of six ele-
ments and 10 operations ( N complexity).

TABLE I
APPLIED TEST DATA

Fig. 1. Experimental setup. A commercial 32 Mbit SRAM memory is con-


nected to an FPGA via ribbon cables long enough such that the FPGA stays
outside the beam line. The power supplies feeding the memory and the FPGA,
as well as the computer receiving the data, are outside the beam room.

In this work, we identify the different categories of MCUs


that may appear in a memory, on the basis of experimental re- the memory and they are commonly used for manufacturing
sults that we obtained through several atmospheric neutron ir- testing. A March test is composed of several elements, and
radiation campaigns. This identification reveals the principal each element is composed of a certain number of operations as
causes of each type of MCU occurrence, and it can be used for indicated in the example in Fig. 2 (March C-). The operations
mitigating such phenomena in future architectures. Although of each element are applied to each cell (for example
similar events have been reported in the past for the same or as indicated in the third element of the March C- algorithm)
different types of memories, such as address decoder failure re- before proceeding to the succeeding one in the address space.
ported in [9] and [10], some of the presented categories have When the operations of an element are applied to all the cells of
never been reported before to the best of our knowledge. We cat- the memory, the proceeding element is considered. Finally, the
egorized the MCUs according to their shapes, sizes and source arrows indicate the direction of the element execution through
of occurrence. The experiments have been performed with an the address space (incrementing or decrementing order).
atmospheric-like neutron beam at the ISIS facilities in the UK During the read operations, the cell content is compared to the
[11]. Besides static mode, SRAM memories were also tested in expected (previously written) value. In case a faulty value is de-
dynamic mode during the irradiation run and as it will be shown, tected, the word containing the corrupted cell(s) is transmitted
this is the main factor that allowed the observation of big clus- from the FPGA to a computer along with its corresponding ad-
ters of upsets. dress and the timestamp of the event. When the memory is in
static mode, the read-back is performed every 30 seconds. Since
II. EXPERIMENTAL SETUP AND CONDITIONS the FSM is responsible for the sensing of the upsets, only the
The memory that we used during the experiments is a com- recorded upsets are transmitted to the computer.
mercial 8-bit word, 32 Mbit 90 nm bulk SRAM with 3.3 V nom- The memories were irradiated under the atmospheric-like
inal power supply voltage. A Finite State Machine (FSM), im- neutron beam of the ISIS facilities in the UK [11]. Emitted
plemented through a Field Programmable Gate Array (FPGA), neutrons were in the energy range of 10-800 MeV and the flux
was responsible for the execution of the different tests applied to was always in the order of cm . Experiments
the memory. The FPGA was operating at 50 MHz, which corre- were conducted with different devices of the same part and
sponds to the memory operation just below 15 MHz, close to the lot number, so that the results would not be related to one
maximum nominal speed. A computer located outside the beam memory chip. Table I shows the total number of Single Bit
room was connected to the FPGA in order to collect the data, Upsets (SBU), their corresponding Soft Error Rates (SER), the
while the power supplies (feeding the FPGA and the memory) conditions under which each test took place and also the types
were also placed outside the beam room, to allow power cy- of MCUs recorded for each test, according to the identification
cling in the case of a radiation induced latchup. In certain exper- that will follow. Information provided by Table I is only a
imental setups the temperature of the SRAM was elevated using portion of the large amount of data processed by considering
a thin foil heater and a thermocouple, both controlled by a tem- numerous campaigns with different test and setup conditions.
perature controller instrument. The basic experimental setup is The SER was calculated by considering only the SBUs (the
depicted in Fig. 1. MCUs were identified using an in-house tool and considered
The devices were irradiated with both the static and dynamic as single events) and considering the respective atmospheric
modes and under different operating conditions. In dynamic neutron fluxes given by the facility. Finally, a relative error of
mode testing, different March algorithms such as “Dynamic 10% should be taken into consideration for the SER based on
Stress” [12], “March C-” [13], “Mats+” [14] have been used known systematic errors at this facility.
as already described in [12], while the static mode testing
was performed using a logical checkerboard data background III. MCU ANALYSIS
(“10101010”). March algorithms are a combination of read and This work is based on the data collection from several exper-
write operations that are applied to the entire address space of imental campaigns while irradiating the commercial SRAMs.
TSILIGIANNIS et al.: MULTIPLE CELL UPSET CLASSIFICATION IN COMMERCIAL SRAMS 1749

TABLE II
MCU TYPES CROSS SECTION PER BIT

Since similar technology is used by most vendors in terms of


powering schemes architecture, the same principles presented
here can be applied for other memories as well. In order to facil-
itate the presentation of our results, we categorize the recorded
MCUs according to their shapes and sizes. To properly iden-
tify the different categories of MCUs, it is essential to be able
to physically represent the recorded upsets, on a bitmap for
example, and also determine the time of occurrence of each
recorded upset.
The experimental setup that we described helped us acquire
the time of occurrence of each event, its position in the memory
physical map, and also the operation and element of the applied
March test during which the upset was sensed. In Fig. 3 an ex-
ample of the results obtained by applying the “Dynamic Stress”
March test is depicted. Each white pixel of the picture repre-
sents an “uncorrupted” bit cell while the pixels in black are the
recorded upsets. We note that although the words were accessed
using the scrambled addresses during the testing, in Fig. 3 the
addresses are unscrambled, thus, this is the physical bitmap of
the memory cells. Finally, the vertical grey line in the middle of
the memory indicates the position of the address decoder while
the horizontal black line in the middle of the figure separates
the memory in two equal parts. This separation is made because
the memory under test is composed of two separate dices. This
visual representation provides a global view of the bitmap, with
all the accumulated upsets during the irradiation experiments.
The bitmap in Fig. 3 is the result of the DUT exposed under
a flux of the order of cm for two hours. As
Fig. 3 shows, besides the SEUs, different types of clusters of
upsets are recorded when the memory was tested with the “Dy-
namic Stress” March test. Similar results were obtained for all
the applied March tests when the memory was in dynamic mode.
Conversely, when the memory was tested in static mode, we
Fig. 3. 32 Mbit SRAM bitmap with all the upsets accumulated during one hour
never observed big clusters of errors. From the results displayed and fifty minutes of accelerated atmospheric neutron irradiation. The applied
in Fig. 3, and from several other radiation tests, we were able test was the Dynamic Stress and the memory was heated to 50 during the
to distinguish four categories of MCUs that will be detailed in exposure. Pixels in black represent the upset cells.
the following sub-sections. Information such as the event time-
stamp and the address scrambling code provided us with the
knowledge of the exact location and time of occurrence of each A. Typical MCU: Type A
upset, and furthermore its classification. At this point, it is inter- Type A corresponds to the typical cases of MCUs that have
esting to note that after the occurrence of all the reported types been presented in most of the previous studies [1][2] and [4].
of MCUs, no power cycling was needed, and thus the memory Such MCUs have been observed throughout all the different
continued operating in normal mode. Considering the fact that tests that we executed (i.e. in both the static and dynamic
no correction schemes were applied, and even if they were, the modes), and is the only type of MCU that occurs in static
number of corrupted cells was too large to achieve effective mit- mode for the memories that we used. Fig. 4 shows four cases
igation, such events raise an important reliability issue. Table II occurring while the memory was tested in static mode.
summarizes the obtained cross sections for each type of MCU According to the studies in [1] and [2], such upsets occur
and also the number of occurrences of each type during the ex- when one of the secondary charged particles that is generated
periments. from a particle strike traverses sensitive volumes of more than
1750 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014

cells, they have to be taken into consideration when evaluating


the sensitivity of the studied devices under neutron radiation.
The major problem of Type B MCUs is due to the non-effec-
tiveness of common ECC algorithms to correct more than two
upset bits belonging to the same word. From measurements
Fig. 4. Type A MCU: typical cases of MCU occurring while the memory was performed during our experiments, the number of corrupted
tested in static mode at 50 . Similar results have been observed also when the bit-cells is in the range of 100-700. From Fig. 5, it is clear
memory was tested in dynamic mode with March algorithms.
that, due to the extension of the MCU, more than two bit cells
belonging to the same word are corrupted, making the bit
interleaving scheme ineffective.
Such events are different from Type A MCUs, since the col-
lision of a single particle cannot induce a charge that can di-
rectly affect such a large number of cells. MCUs similar to these
have been reported in [5] as micro-latchups but forming dif-
ferent shapes. The difference in the error map patterns observed
in this work and those of [5] is possibly due to differences in the
architectures of the SRAMs under investigation. Specifically,
the difference in the position of the p- and n-well taps for each
SRAM that electrically isolate small regions of the memory, in
which micro-latchups are localized. The well taps operate error
mitigation by reducing the resistance between the particle strike
position and the collection node, and consequently reducing the
Fig. 5. Type B MCU occurring when tested with: (a) March C- algorithm–337 voltage drop for the triggering of the parasitic bipolar structure
cells upset (b) Dynamic Stress algorithm at 50 –582 cells upset (c) Mats + that is responsible for the latchup [15]. At the same time, the
algorithm at 60 –552 cells upset - and (d) Dynamic Stress algorithm at nom-
inal conditions–261 cells upset. The area affected is usually 16x128 cells. powering scheme of these electrically isolated blocks localizes
the latchup to small regions that are fully powered only when an
operation occurs. According to the technology of the memory
one cell, inducing in them charge large enough to upset sev- under test, when an electrically isolated block (defined by the
eral cells. Another possible source can be that several secondary taps) is in retention mode it is not fully powered: the level of
particles (generated from the same nuclear reaction) travel in supply voltage is lowered so that the data retention is ensured,
different directions, inducing charge in the sensitive nodes of keeping the power consumption low at the same time. This low
neighboring cells. Both cases of MCUs have been verified in power scheme prevents the occurrence of latchups in all non-se-
studies at the simulation level [1], [2]. Redundancy or ECC al- lected blocks according to [16], since holding voltages below
gorithms can be used in order to mitigate such events. In some V would not trigger parasitic bipolar structures. Once an
cases, successful mitigation may not be possible. It always de- operation is applied to a cell, the block (defined by the well taps)
pends on the error detection and correction schemes that exist to which it belongs to is fully powered for one operation cycle.
both at the hardware and software level. To achieve mitigation, As reported in [5] if a latchup occurs in that block during the
hardware designers apply bit interleaving schemes, in which operation, the lowering of the voltage levels in the other blocks
two consecutive bits of a word, are not physically consecutive, and the well tapping scheme will topologically limit the latchup
but they have other bit cells interfering them. This way, an MCU (micro-latchup), which would endure one operation cycle at
will corrupt two or more contiguous cells, but these cells will not maximum. Thus, the device will not have to be power cycled
belong to the same word. Commercial memories are often de- after the micro-latchup occurrence.
signed with a distance, of 8 to 16 cells between two consecutive The fact that the memory in our experiments uses such an
bit cells belonging to the same word. architectural scheme explains the shapes and sizes of Type B
MCUs. These micro-latchups do not propagate to the rest of
B. Rectangular horizontal MCU: Type B the device and they are observed only during dynamic mode
Continuing our analysis, we present the second type of MCU testing and not static (when all blocks are in low power mode).
that occurred in all the memories when tested in dynamic mode, According to the vendor, the well taps that define the electric
but never observed when the memory under test was in static block are located every 16 cells vertically and 128 cells hori-
mode. Fig. 5 displays some random occurrences of these up- zontally, and they assist to the blocking of the transmission of
sets when the memory was tested with the C-, the Mats+ and the latchup. In a typical occurrence of a Type B upset and ac-
the Dynamic Stress March tests at different temperatures. These cording to the positioning of the p-well taps that are connected
events are identified as a result of one incident neutron since all to and the n-well taps that are connected to the ground, the
the upset cells share the same timestamp and were sensed under latchup will propagate from the upper part (p-well) towards the
the same March element. bottom (n-well), and thus we will have a relatively uniform dis-
This type of MCU has rectangular horizontal shape and tribution of the corrupted cells as seen for example in Fig. 5(b).
appears rather frequently. Considering the number of affected However, there are cases where the latchup propagates hori-
TSILIGIANNIS et al.: MULTIPLE CELL UPSET CLASSIFICATION IN COMMERCIAL SRAMS 1751

Fig. 7. Schematic of the cycle of operation of an SRAM. It is shown that the


cycle does not cover only the time of the operation execution, but additional
time is required for the restoring of the control signals to their normal state.
Fig. 6. Mats + algorithm: (a) Type B MCU affecting the one side of the elec- During the entire duration of the cycle, the latchup propagation is probable.
tric block and successfully stopped by the p-well gap (b), (c) Type B MCU
expanding towards a neighboring block but weakened. The p-well gap that re-
duces the latchup propagation between two well taps is also demonstrated.

zontally, forming the “bird tail” shapes that we can observe in


Fig. 5(d).
For each electric block, in the middle of the horizontal axis,
there is a gap in the p-well that also works against the horizontal
propagation of the latchup, as indicated in Fig. 6. The p-well gap Fig. 8. Type C MCU: (a), (b) Dynamic Stress algorithm at 50 –32112 cells
upset (c), (d) March C- algorithm in room temperature–46504 cells upset.
in the middle of the electric block reduces the propagation of the
latchup, and when it spreads, it is often significantly weakened,
as shown in Fig. 6.
As it can be seen either in Fig. 5(a, d) or in Fig. 6, the micro- either ‘1’ or ‘0’, depends on two random factors: the noise on
latchup is not always uniformly distributed inside the electric the power supply and the local Random Dopant Fluctuations
block. This is related to the time occurrence of the latchup with (RDF) that make the cells not perfectly symmetrical. If we de-
respect to the applied operation. According to the described low fine the sensitive area of a Type B upset as the area limited by a
power cycling mechanism, the latchup will propagate during the rectangle, shaped by the minimum and maximum values of up-
cycle of the operation in the selected (powered up) block before sets belonging to the same event, we observe that in the majority
it returns to low power mode. Depending on the time of occur- of recorded Type B MCUs, the percentage of corrupted cells in
rence of the latchup during the operation cycle, the latchup will the clusters was of the order of 20-40%. Of course, such a def-
have more or less time to expand through the electric block de- inition of the sensitive area does not always reveal the realistic
fined by the taps. In Fig. 7, a timing diagram of read and write distribution of corrupted cells, since the latchup does not always
operations is depicted. For the read operation, for example, a propagate uniformly (as for example in the bird tail shape cases)
certain time is required for the restoration of the bit lines to re- in the full area of the electric block.
turn to their normal state, as shown in the scheme. It is during We have to note that the well tap and low power scheme
the entire duration of the cycle that the micro-latchup can propa- seems to be very efficient in reducing latchup propagation since
gate towards the electric block and not only during the operation it has been very rare that a Type B upset expanded towards a
execution. Considering that the transient current induced by the second electric block. In all of the March tests that we applied,
particle strike has a very limited duration (few hundred picosec- most of the times there is a common sequence of operations in
onds) with respect to the overall operation cycle (a few tens of the elements. They are either standalone write operations, as
nanoseconds), the latchup has sufficient time to be transmitted in the case of the first element of the March C- algorithm, or
before the power down of the block. a read operation followed by a write operation as are most of
As seen in Fig. 5(d), the affected cells are not uniformly dis- the elements of the March C- algorithm. In all of the cases, this
tributed and also the density of the corrupted cells (transition write operation overwrites the Type B MCU and thus there is
from the meta-stability state to a stable one) is not always the no multi-error detection for the same event.
same. Furthermore, in Fig. 6 the transmission of the latchup to-
C. Rectangular horizontal MCU: Type C
wards an electric block is seen in different lengths (Fig. 6(a)
being the one with the smallest duration). When a latchup oc- The third type of recorded MCUs, referred as Type C, affects
curs, the SRAM cells enter a meta-stability state, from which a larger area with respect to Type B. To the best of our knowl-
they exit towards a stable one (storing either ‘1’ or ‘0’) when edge, Type C events have not been reported so far in the lit-
the latchup condition is finished. Since the pattern written to the erature. Fig. 8 displays the events as recorded in two different
memory cells is always the same (either all ‘1’, or ‘0’ depending devices tested under different algorithms.
on the March element), we expect an approximate distribution Type C clusters are significantly larger than Type B, involving
of 50% of corrupted cells. The distribution of the cells storing 30k up to 50k upset cells. They have a horizontal extension of
1752 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014

events a very specific pattern, this pattern is nothing more


than the translation to the physical map of a solid cluster of
consecutive scrambled addresses.
Considering the interpretation of the data, we can describe
this event as a Single Event Functional Interrupt (SEFI) occur-
ring in the memory periphery. It is rather clear that a single par-
ticle cannot induce such a symmetric pattern neither by gener-
ating multiple secondary particles as a result of an inelastic re-
action, or as a result of a micro-latchup. Exploring the several
possible causes for such an event, one of the first candidates to
failure is the address decoder. If an upset occurred in the ad-
Fig. 9. Type C MCU zoom in: (a), (b) are the zoomed in patterns of Fig. 8(a,b) dress decoder input buffer for example, in a way that one or
while (c), (d) are the zoomed in patterns of Fig. 8(c, d). more bit(s) of the address was stuck at a certain value for a time
span, big enough to induce such an upset, it would be a possible
explanation. In this scenario, we should consider the case study
upsets that covers one edge of the memory to the other as seen in which a March element with operations is followed by
in Fig. 8(c,d), or in some cases as seen in Fig. 8(a,b) from the the element . If the element is applied to a wrong
middle of the memory to the one edge which corresponds to address because of a stuck bit (we consider that it will cover a
Fig. 3 MCU Type C. In addition, its vertical extension covers large range of cells as the ones showed in Fig. 8), it means that in
approximately 20-60 bit cells (i.e. vertical lines). It appears that one region of the memory this element will not be applied, and,
for all the presented cases, there is a repeated sequence, each at same time, in another region, it will be erroneously applied.
time a different one as shown in Fig. 9, which is a zoom in of For example, the faulty execution of the element comes from
the sequences presented in Fig. 8. Since the recorded sequence the fact that this element will be applied to a region, where the
was different each time that it appeared, we do not have enough same March element was already applied, and thus the region
statistical data to correlate the error sequence with the applied will have a ‘1’ stored in its cells, while the element’s first opera-
test. It should be mentioned that the sequences of Fig. 8(a) and tion is a . In this case, this action will leave another region of
(b) (or Fig. 9(a), (b) being zoomed in) were recorded concur- the memory without having the March element ( ) applied.
rently as well as the ones of Fig. 8(c) and (d) (or Fig. 9(c), (d) At the next March element however ( ) this other region
being zoomed in), although they affect cells located in different will have a ‘0’ stored at the cells, and it will contain the exact
positions of the bitmap, meaning that these upsets belong to the same number of cells that were faulty on the previous March
same event. The difference between these two cases is their size, element. As a result, the exact same number of failing bit-cells
shape and placement in the memory, as can be seen, for example, should be seen twice, in two different regions of the memory
in the global memory map of Fig. 3. and as a result of two failing different March elements. How-
A first thing to be noted is that Type C events were recorded in ever, this is not the case that we observe in Fig. 8 and Fig. 9.
different devices and also different irradiation campaigns (one Continuing the search of possible explanations for the occur-
year gap between them), meaning that these upsets are not re- rence of Type C MCUs, we may consider that an error would
lated to a defective device. Another important observation is that occur to the input/output data buffers. This is the most possible
for all the cases, the corrupted bits of the words were all set to scenario and it can be the result of two types of failures. The
‘0’ instead of being set to ‘1’, i.e. a complete “word flipping” first type is the failure of the data buffer to provide with the cor-
was observed. Also the March algorithm, while being different; rect input data for the write operation. Consequently, instead of
the element and operation for both these recorded upsets was the writing a ‘1’, a ‘0’ is written to the cells, as in the case of a
same, i.e. the event was sensed during the first operation of the element. This would result to a failing write operation and the
5th element for both the Dynamic Stress and the March C-. The cells will store the wrong content. The next read operation of the
operation that sensed the Type C event was a read ‘1’ followed proceeding March element will sense this failure. The second
by a write zero, but since the number of Type C MCU that we case represents the failure of the read circuitry to store the cor-
observed is very small, we cannot draw conclusions about their rect data in the data buffer. For example, in a operation,
correlation to the March element. there is a failure of the read ‘1’ operation from the cells to set
In processing data forming Type C events, we observed that the data buffer at ‘1’, and a ‘0’ is returned. In this case, we have
the upset words belong to consecutive scrambled addresses. read operations failing. Nevertheless, the cells themselves are
While the address input to the memory (scrambled) is a 22 bit not corrupted. At the system level, we cannot predict how such
number, which we either increment or decrement depending on an event would be translated and what will be the impact of the
the March element (arrow direction in front of each element), failure (depending on the ECC schemes, failure schemes, etc).
the un-scrambled address is a combination of two numbers We cannot claim with certainty if the recorded Type C MCUs
forming the X and Y axis of the memory (row and column). As were a result of failing read or failing write operations, since the
already described, when we apply each operation, we consider result is the same for both cases. To the best of our knowledge,
the scrambled addresses, and not the X and Y physical location although SEFIs have been reported in the past (probably similar
of the words. Consequently, although we observe in Type C to Type C MCUs), they have not been mapped in the memory
TSILIGIANNIS et al.: MULTIPLE CELL UPSET CLASSIFICATION IN COMMERCIAL SRAMS 1753

Fig. 10. Type D MCU: (a) Dynamic Stress at 50 –38246 cells upset (b) March
C- algorithm–89934 cells upset, (c) Mats + at 60 –23964 cells upset (d) Dy-
namic Stress nominal–41022 cells upset.

array and presented such that to have sufficient data to compare


with our results.
It is important to note that only a few of such events have been
Fig. 11. Type D MCU zoom in: (a) part of a Type D upset during March C-
observed among the tests that we applied, indicating that their in 110 ,(b) part of Type D upset during Mats + in 80 , (c) part of Type D
frequency of occurrence is rather small. However, they should upset during Mats + in 90 , (d) and (e) are part of Type D with a March C-
be taken into account when it comes to the memory’s sensitivity algorithm in nominal conditions.
evaluation since the number of cells affected is rather large (30k
up to 50k bit-cells). If we assume that Type C events are asso-
ciated with functional interrupts, it is not interesting to provide regions of cells in which the faulty cells are less dense
with a percentage of corrupted cells with respect to the overall than in the rest of the corrupted area. A possible explanation is
sensitive surface as in the Type B MCUs, since we cannot de- the existence of separate power lines combined with current lim-
fine the potential surface. iters for these regions that are completely independent from the
rest of the corrupted area. Thus, a latchup, for example, in the
D. Rectangular vertical MCU: Type D power switches of the memory would leave some of the power
Type D MCUs were observed with all the different test algo- lines unaffected (i.e. they would not suffer from current drops),
rithms applied in dynamic mode. Fig. 10 indicates a few exam- explaining the white discontinuity areas. Also, as in Type B, the
ples observed through the different test runs, as they were iso- duration of the event (either if it is a latchup or a Single Event
lated from the rest of the occurring upsets. Such MCUs can cover Transient (SET) inducing the error) affects its propagation.
a large part of a memory region as can be seen in Fig. 10(b) By taking a closer look into some of the observed Type D
(89944 corrupted cells) or smaller parts but still significant as in events, we observe that the corrupted cells are not similarly dis-
Fig 10(c) (23964 corrupted cells). Type D events usually involve tributed as it is for Type B or Type C events. Observing Type
20.000-90.000 corrupted cells, inside an area of 64x2048 cells in D upsets from a macroscopic point of view, it seems that all
a vertical positioned rectangular cluster according to our experi- these upsets form dense regions of corrupted cells but Fig. 11
mental results. So far, we have observed that Type D events are can provide us with a closer insight of the different cases. This
always limited in such defined area. To the best of our knowledge zoomed-in view of different Type D MCUs, allows us to make
such events have never been reported before in the literature. additional hypothesis about the nature of the Type D MCU. In
Although Type D MCUs resemble Type B events, they cannot the case of Fig. 11(a) or Fig. 11(c) the distribution of corrupted
be the result of micro-latchups that are localized in failing areas cells is , which can be considered as the result of a typical
of a larger scale. As observed in Fig. 10, there are regions of latchup. However, in Fig. 11(b) and (d), the distribution of cor-
64x64 cells without any upset cells, making these “white spots” rupted cells is much higher. This can be the result of a latchup
clear uncorrupted areas, in the middle of a big cluster of up- occurring to the periphery of the memory, and more specifically
sets. If Type D upsets were the result of a micro-latchup in to the output buffers. As a consequence, corrupted data will be
the memory array, a contiguous area of corrupted cells would transmitted at the output of the memory for a period longer than
be expected, without the observed discontinuity areas. Type D one operation cycle, since the low powering scheme, does not
events cannot be correlated to Type C either, since there is no re- apply for this part of the memory. In this case, the white dis-
peated pattern of upset cells. It should be noted that these Type D continuity spots are due to scrambling of the address (which is
MCUs appear always at the vertical edges of the memory array used during experiments and then on the data processing the ad-
(left and/or right). This fact may be possibly explained with the dresses are decoded).
concurrence of micro-latchups in this peripheral locations and Exploring the different sources of occurrence of Type D events,
major delay of the controlling signals (such as word line selec- another possible explanation could come from a localized latchup
tion) that are larger in locations that are far from the address in the region of the synchronization circuitry. In order to properly
decoder. The discontinuity regions have been observed in dif- act the read operation, the word line signal must be activated for a
ferent devices and experiments, thus they cannot be correlated certain amount of time to connect the selected cell with its bit lines
to a single device. Also, in the different cases of Fig. 10 there are that have been previously pre-charged at .
1754 IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 61, NO. 4, AUGUST 2014

not need to be power cycled when a latchup occurs. Through


this work we revealed some of the issues that may occur during
the dynamic mode operation of this SRAM. Even though this
memory uses some very specific powering schemes, or layout
schemes, it is very probable that memories of other vendors
utilize similar architectural techniques. Although static mode
testing is usually considered as a standard for neutron testing,
for high-reliability systems, the presented events as a result of
dynamic mode testing may potentially occur and thus, dynamic
testing should also be taken into account. Type C cross sections
for Mats+ and Dynamic Stress tests were found to be zero since
we did not have any experimental data. If the experimental time
was larger, it would be possible to observe Type C events, but
Fig. 12. Schematic view of the circuits involved in in a read operation: SRAM we expect that the cross section would remain low.
cell, Bit Lines, Word Line, Sense Amplifier (SA) and the Output Buffer. The
differential voltage of the two bit lines is detected by the sense amplifier
that provides the final output to the buffer. REFERENCES
[1] A. D. Tipton, J. A. Pellish, R. A. Reed, R. D. Schrimpf, R. A. Weller, M.
H. Mendenhall, B. Sierawski, A. K. Sutton, R. M. Diestelhorst, G. Es-
During the read operation the node at ‘0’ of the cell (left node pinel, J. D. Cressler, P. W. Marshall, and G. Vizkelethy, “Multiple-bit
in Fig. 12) partially discharges the bit line to which it is con- upset in 130 nm CMOS technology,” IEEE Trans. Nucl. Sci., vol. 53,
no. 6, pp. 3259–3264, Dec. 2006.
nected. The voltage drop of this bit line ( ) depends on how [2] F. Wrobel, J.-M. Palau, M.-C. Calvet, O. Bersillon, and H. Duarte,
long the word line signal is activated. Generally, the read ac- “Simulations of nucleon induced nuclear reaction in a simplified
cess has a duration that allows reaching a differential voltage be- SRAM structure. Scaling effects on SEU and MBU cross section,”
IEEE Trans. Nucl. Sci., vol. 48, no. 6, pp. 1946–1952, Dec. 2001.
tween the two bit lines ( ) in the order of . This quan- [3] G. Gasiot, D. Giot, and P. Roche, “Multiple cell upsets as the key
tity is commonly considered a sufficient differential voltage to contribution to the total SER of 65 nm CMOS SRAMs and its depen-
make the sense amplifier work properly and give the correct dence on well engineering,” IEEE Trans. Nucl. Sci., vol. 54, no. 6, pp.
2468–2473, Dec. 2007.
output. [4] D. Radaelli, H. Puchner, S. Wong, and S. Daniel, “Investigation of
If the synchronization circuitry does not work properly, this multi-bit upsets in a 150 nm technology SRAM device,” IEEE Trans.
access time can be reduced so that the is lower than ex- Nucl. Sci., vol. 52, no. 6, pp. 2433–2437, Dec. 2005.
[5] J. Tausch, D. Sleeter, D. Radaelli, and H. Puchner, “Neutron induced
pected and the sense amplifier provides as output a random micro SEL events in COTS SRAM devices,” in Proc. Radiation Effects
value. Thus, if the synchronization circuitry undergoes to a short Data Workshop, 2007, pp. 185–188.
time event, like an SET, we can have one or more than one [6] A. Hands, P. Morris, K. Ryden, and C. Dyer, “Large-scale multiple cell
upsets in 90 nm commercial SRAMs during neutron irradiation,” IEEE
operations that may fail. On the other hand, an event like a Trans. Nucl. Sci., vol. 59, no. 6, pp. 2824–2830, Dec. 2012.
micro-latchup in the periphery location of the memory chip (that [7] C. Poivey, B. Doucin, T. Carriere, P. Calvel, and R. Marec, “Heavy ion
does not follow the power down and up cycling) can have the induced gigantic multiple errors in state of the art memories,” in Proc.
European Space Components Conference, 2000, vol. 439, p. 13.
duration of many cycles and would produce many consecutive [8] A. D. Tipton, J. A. Pellish, J. M. Hutson, R. Baumann, X. Deng, A.
failing operations that can appear like a cluster of errors. In con- Marshall, M. A. Xapsos, H. S. Kim, M. R. Friendlich, M. J. Campola,
clusion, since the number of cells affected by Type D MCU is C. M. Seidleck, K. A. LaBel, M. H. Mendenhall, R. A. Reed, R. D.
Schrimpf, R. A. Weller, and J. D. Black, “Device-orientation effects
rather large and the event cross section is important for such a on multiple-bit upset in 65 nm SRAMs,” IEEE Trans. Nucl. Sci., vol.
big area of corrupted cells, memories that are meant to be used 55, no. 6, pp. 2880–2885, Dec. 2008.
in high reliability applications should be extensively tested for [9] P. T. McDonald, W. J. Stapor, A. B. Campbell, and L. W. Massengill,
“Non-random single event upset trends,” IEEE Trans. Nucl. Sci., vol.
the revealing of such type of event. 36, no. 6, pp. 2324–2329, Dec. 1989.
[10] E. H. Cannon, J. Tostenrude, M. Cabanas-Holmen, B. Meaker, C.
IV. CONCLUSIONS Neathery, M. Carson, and R. Brees, “At-speed SEE testing of RHBD
embedded SRAMs,” IEEE Trans. Nucl. Sci., vol. 60, no. 6, pp.
Besides the Type A MCU, in which a few cells are corrupted, 4207–4213, Dec. 2013.
the rest of the reviewed cases impact a large number of cells, [11] ISIS Rutherford Appleton Laboratory [Online]. Available: http://www.
isis.stfc.ac.uk/
a matter that can be of great importance at the system level. [12] G. Tsiligiannis, L. Dilillo, A. Bosio, P. Girard, A. Todri, A. Virazel, A.
We showed that Type B, C and D MCUs were observed only D. Touboul, F. Wrobel, and F. Saigne, “Evaluation of test algorithms
when the memories were tested in dynamic mode. Thus, dy- stress effect on SRAMs under neutron radiation,” in Proc. IEEE Inter-
national On-Line Testing Symposium (IOLTS), 2012, pp. 121–122.
namic mode testing revealed phenomena that would have never [13] M. Marinescu, “Simple and efficient algorithms for functional RAM
been observed during static mode testing. Moreover, dynamic testing,” in Proc. IEEE Int. Test Conf., 1982, pp. 236–239.
mode testing takes into account a more realistic scenario of [14] D. Niggemeyer, M. Redeker, and J. Otterstedt, “Integration of non-
the memory operation, since in real systems the memory does classical faults in standard March tests,” in Proc. IEEE Int. Test Conf.,
1998, pp. 53–62.
not remain only in retention mode. However, additional experi- [15] H. Puchner, R. Kapre, S. Sharifzadeh, J. Majjiga, R. Chao, D. Radaelli,
ments that will explore the dependence of such events with the and S. Wong, “Elimination of single event latchup in 90 nm SRAM
frequency should be performed in the future. It seems that the technologies,” in Proc. 44th Annual Reliability Physics Symposium
Proceedings, 2006, pp. 721–722.
architecture of the memory, does not allow latchups to propa-
[16] D. J. Sleeter and E. W. Enlow, “The relationship of holding points and
gate to the entire memory, but limits them to small areas and a general solution for CMOS latchup,” IEEE Trans. Electron Devices,
to one memory cycle duration. For this reason, the device does vol. 39, no. 11, pp. 2592–2599, Nov. 1992.

You might also like